Closed bartlettroscoe closed 1 year ago
I just rebased my branch in #4026 against develop and it still failed with the same error.
I see the problem. The 7.2.0 files are in master, but somehow they are currently missing from dev.
I see the problem. The 7.2.0 files are in master, but somehow they are currently missing from dev.
@ZUUL42, you can just merge 'master' to 'develop'. That is accepted git practice.
@trilinos/framework
From looking at the recent PRs, it looks like about 8 PRs are currently blocked by this bad Trilinos_pullrequest_gcc_7.2.0
build.
I added AT: RETEST
to a few ATDM-related PRs to see if this autotester is working now.
@trilinos/framework
Looks like the PR build Trilinos_pullrequest_gcc_4.9.3_SERIAL
is busted. It took down two PR testing iterations so far https://github.com/trilinos/Trilinos/pull/4040#issuecomment-446773756 and https://github.com/trilinos/Trilinos/pull/4031#issuecomment-446789394. It is a configure failure as shown here for example showing:
Processing enabled TPL: BLAS (enabled explicitly, disable with -DTPL_ENABLE_BLAS=OFF)
-- BLAS_LIBRARY_NAMES='blas blas_win32'
-- Searching for libs in BLAS_LIBRARY_DIRS=''
-- Searching for a lib in the set "blas blas_win32":
-- Searching for lib 'blas' ...
-- Searching for lib 'blas_win32' ...
-- NOTE: Did not find a lib in the lib set "blas blas_win32" for the TPL 'BLAS'!
-- ERROR: Could not find the libraries for the TPL 'BLAS'!
-- TIP: If the TPL 'BLAS' is on your system then you can set:
-DBLAS_LIBRARY_DIRS='<dir0>;<dir1>;...'
to point to the directories where these libraries may be found.
Or, just set:
-DTPL_BLAS_LIBRARIES='<path-to-libs0>;<path-to-libs1>;...'
to point to the full paths for the libraries which will
bypass any search for libraries and these libraries will be used without
question in the build. (But this will result in a build-time error
if not all of the necessary symbols are found.)
-- ERROR: Failed finding all of the parts of TPL 'BLAS' (see above), Aborting!
-- NOTE: The find module file for this failed TPL 'BLAS' is:
/scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_gcc_4.9.3_SERIAL/Trilinos/cmake/tribits/common_tpls/FindTPLBLAS.cmake
which is pointed to in the file:
/scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_gcc_4.9.3_SERIAL/Trilinos/TPLsList.cmake
TIP: Even though the TPL 'BLAS' was explicitly enabled in input,
it can be disabled with:
-DTPL_ENABLE_BLAS=OFF
which will disable it and will recursively disable all of the
downstream packages that have required dependencies on it.
When you reconfigure, just grep the cmake stdout for 'BLAS'
and then follow the disables that occur as a result to see what impact
this TPL disable has on the configuration of Trilinos.
CMake Error at cmake/tribits/core/package_arch/TribitsProcessEnabledTpl.cmake:144 (MESSAGE):
ERROR: TPL_BLAS_NOT_FOUND=TRUE, aborting!
Call Stack (most recent call first):
cmake/tribits/core/package_arch/TribitsGlobalMacros.cmake:1711 (TRIBITS_PROCESS_ENABLED_TPL)
cmake/tribits/core/package_arch/TribitsProjectImpl.cmake:202 (TRIBITS_PROCESS_ENABLED_TPLS)
cmake/tribits/core/package_arch/TribitsProject.cmake:93 (TRIBITS_PROJECT_IMPL)
CMakeLists.txt:90 (TRIBITS_PROJECT)
I'm seeing some warnings in the PR configure process:
CMake Warning:
Value of Trilinos_ENABLE_TESTS contained a newline; truncating
For example, see here.
Value of Trilinos_ENABLE_TESTS contained a newline; truncating
I have never seen that before. Would need to see the ctest -S driver code to see what that is about.
@bartlettroscoe - the las errors are all on one machine - It looks like will realized that and took it back out of our pool. Not sure what the newline complaint is about though.
FYI: Just get a really strange PR autotester crash just now in https://github.com/trilinos/Trilinos/pull/4064#issuecomment-447956256 showing:
Status Flag 'Pull Request AutoTester' - Failure: Timed out waiting for job Trilinos_pullrequest_intel_17.0.1 to start: Total Wait = 603
- Other jobs have been previously started - We must stop them...
What does that mean?
@trilinos/framework
More crashes in https://github.com/trilinos/Trilinos/pull/4079#issuecomment-448443534 and https://github.com/trilinos/Trilinos/pull/4079#issuecomment-448448274. This time, the Trilinos_pullrequest_gcc_7.2.0
build both times shows:
Checking out Revision 0d37651428ceee3028b577be97a9b87309bff684 (refs/remotes/origin/develop)
> git config core.sparsecheckout # timeout=10
> git checkout -f 0d37651428ceee3028b577be97a9b87309bff684
FATAL: Could not checkout 0d37651428ceee3028b577be97a9b87309bff684
hudson.plugins.git.GitException: Command "git checkout -f 0d37651428ceee3028b577be97a9b87309bff684" returned status code 128:
stdout:
stderr: fatal: unable to write new index file
at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:2002)
at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.access$800(CliGitAPIImpl.java:72)
at org.jenkinsci.plugins.gitclient.CliGitAPIImpl$9.execute(CliGitAPIImpl.java:2301)
@trilinos/framework,
And it crashed again https://github.com/trilinos/Trilinos/pull/4079#issuecomment-448452704. This time the build Trilinos_pullrequest_gcc_7.2.0
showed the error:
fatal: sha1 file '.git/objects/pack/tmp_pack_SSP6Df' write error: No space left on device
fatal: index-pack failed
Looks like the disk is filled up.
Do you (or Jenkins) have any system to warn you when a Jenkins slave disk space is getting to high?
FYI: Just had a PR testing iteration fail as shown in https://github.com/trilinos/Trilinos/pull/4332#issuecomment-461130243 due to timeouts of some MueLu tests as shown in:
Are the nodes getting overloaded in the PR testing?
I don't think so. I was just looking because of something else I noticed and we still have some open space for jobs (about 80 processors). Plus this is not how Jenkins show overloading, it tends to just queue the jobs instead. I think we have some network issues (at a WAG). I will ask a few questions.
FYI: PR #4336 testing iteration shown in https://github.com/trilinos/Trilinos/pull/4336#issuecomment-461234824 failed due to a single Intel compiler license problem in the super expensive Intel 17 build (took 5h40 min to build). Ouch
@bartlettroscoe Is there a way to mark a PR test as "optional" like if it fails really quick can we just ignore it?
@mhoemmen asked:
@bartlettroscoe Is there a way to mark a PR test as "optional" like if it fails really quick can we just ignore it?
That is up to the @trilinos/framework team (which I am not a part of).
What that PR tester should do is to made smart enough to see that if someone does a AT: RETEST
and there are no changes to the topic branch, then it should leave the builds the already passed alone and just run the builds that failed again. And if it was really smart, if it saw that just tests failed (like in the case above), it would only rerun the tests. And if it was really really smart, it would on rerun the tests that failed. But all of that adds complexity to the PR tester.
FYI: PR test iteration https://github.com/trilinos/Trilinos/pull/4332#issuecomment-461262726 also failed due to Intel compiler license problems. Got email that SNL's corporate license sever is done so the Intel build will fail in every PR test iteration until they get it fixed :-(
But all of that add complexity to the PR tester.
But it should be done.
I think it's time to redesign this to remove complexity. This is functionality we need, but the current design is already more complex than it should be. We have talked about this some but I'm less sure when it happens.
@trilinos/framework, given that the Intel PR build takes upwards of 6 hours to run and the fact we fairly often suffer build failures due to problems communicating with the license sever, I wonder if the payoff of having the Intel build is worth the cost? It would be good if we could go back over the last 6 months of history of PR testing and see how many real code or test failures the Intel PR build caught that one of the other GCC builds did not also catch? Given the way the PR builds are named, it would be hard to do this analysis (because it is hard to match up the different PR builds that are part of the same PR testing iteration).
Now if the PR tester was more robust so that it only reran the builds that failed, this would be mitigated some and we could tolerate some of these flaky builds better.
Just some thoughts ...
In terms of catching build issues, the CUDA build is more valuable than the Intel build.
FYI: The testing iteration in https://github.com/trilinos/Trilinos/pull/4336#issuecomment-461564312 showed a git update failure for the build Trilinos_pullrequest_gcc_4.8.4 showing:
> git checkout -f e476b66bcdd0fb2efa6018a4d5ff10f5bb02d2b9
FATAL: Could not checkout e476b66bcdd0fb2efa6018a4d5ff10f5bb02d2b9
hudson.plugins.git.GitException: Command "git checkout -f e476b66bcdd0fb2efa6018a4d5ff10f5bb02d2b9" returned status code 128:
stdout:
stderr: fatal: unable to write new index file
at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:2002)
at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.access$800(CliGitAPIImpl.java:72)
at org.jenkinsci.plugins.gitclient.CliGitAPIImpl$9.execute(CliGitAPIImpl.java:2301)
What does "unable to write new index file" mean? Does this mean that the Jenkins slave ran out of disk space? That is what is suggested by:
Yes - I am working on this.
@trilinos/framework,
FYI: Looking at this query and this query, it would seem that every PR testing iteration that runs on the node tr-test-0.novalocal
results in internal compiler errors when trying to build MueLu with the GCC 4.8.4 build. In the last two days, this as impacted (at a minimum) the PRs #4320, #4328, #4336, #4342, #4351, #4354, and #4355.
Can someone please remove the Jenkins slave tr-test-0.novalocal
until this problem can be resolved? We have been waiting to merge #4336 and fix the SPARC 'master' + Trilinos 'develop' since 2/6/2018. We are getting zero testing of SPARC 'master' + Trilinos 'develop' due to the inability to merge PR #4336.
FYI: In PR #4336, we might have a new record for number of false PR testing attempts at 6 tries and 6 failures (none of which relate to the actually changes on the branch). I am trying again hopping that we might get lucky this time and not have the slave tr-test-0.novaloca
being used.
@trilinos/framework,
FYI: Note that this query shows that configure, build, or test failures (timeouts) are shown in every PR iteration run on the node tr-test-0.novalocal
since 2/6/2019 (except where no packages were enabled). This shows this node needs to be removed from PR testing.
FYI: The PR tester ran two more times and filed two more times as shown in https://github.com/trilinos/Trilinos/pull/4336#issuecomment-462089069 and https://github.com/trilinos/Trilinos/pull/4336#issuecomment-462108121 making it 0 for 8. I have put AT: WIP
on #4336 for now since there is no sense in beating a dead horse. Once the PR system is fixed, I will turn this back on.
I am moving ATDM Trilinos testing to an 'atdm-develop-nightly' branch (see TRIL-260). One consequence of this is that I can directly merge this topic branch to 'atdm-develop-nightly' and fix the SPARC 'master' + Trilinos 'develop' builds, while this PR sits in libo. But once this PR can be merged to 'develop', then 'develop' will merge to 'atdm-develop-nightly' just fine. That is kind of a nice workflow for extreme cases like this actually.
FYI: It took 5 1/2 days and 10 PR testing iterations to get a simple revert PR to merge (#4336). But now it is merged. (But I had already taken steps to get the SPARC 'master + Trilinos 'develop' testing back online.)
@trilinos/framework @william76 @jwillenbring @ZUUL42
The PR tester is reporting g++ compiler internal errors for #4357. The PR adds code to Zoltan2, but the compiler internal errors occur when compiling MueLu files.
Is there something I can do to resolve this problem and get the PR merged? Using the instructions to reproduce PR errors, I was able to successfully build Trilinos on a linux workstation; the MueLu files reporting errors in CDASH above compiled without problem in my attempt to reproduce.
According to the google, these errors arise from insufficient memory; most recommendations are for reducing concurrency in the build.
I see an issue similar to what @kddevin reported: https://testing-vm.sandia.gov/cdash/viewBuildError.php?buildid=4553706.
Is there something unique to the test machine novalocal
?
The CUDA PR tester is showing errors in ROL test that appear to be unrelated to my commits. See here. Nevermind, I was looking at the wrong PR!
@trilinos/framework,
Just saw in https://github.com/trilinos/Trilinos/pull/4629#issuecomment-476982448 what looks like a random build error for the GCC 4.8.4 build shown here showing:
collect2: error: ld terminated with signal 9 [Killed]
That looks like the "linking MueLu ran out of memory" situation that we've seen sometimes.
FYI: Looks like more random crashes and internal compiler warnings in https://github.com/trilinos/Trilinos/pull/4744#issuecomment-477567876. The build on CDash shown here show build errors like:
collect2: error: ld terminated with signal 9 [Killed]
and
g++: internal compiler error: Killed (program cc1plus)
That PR #4744 is just removing some warnings for unused variables so it is hard to see show that could cause errors like this.
@trilinos/framework
The PR testing iteration https://github.com/trilinos/Trilinos/pull/4750#issuecomment-477625924 just failed due to the gcc_7.2.0 build not being able to submit to CDash but the configure, build, tests all passed showing:
Starting configure step.
...
Error when uploading file: /scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_gcc_7.2.0/pull_request_test/Testing/20190328-1105/Configure.xml
Error message was: The requested URL returned error: 503 Service Unavailable
Problems when submitting via HTTP
configure submit error = -1
Configure suceeded.
Starting build step.
...
Build succeeded.
Error when uploading file: /scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_gcc_7.2.0/pull_request_test/Testing/20190328-1105/Build.xml
Error message was: The requested URL returned error: 503 Service Unavailable
Problems when submitting via HTTP
build submit error = -1
Starting testing step.
Tests succeeded.
Error when uploading file: /scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_gcc_7.2.0/pull_request_test/Testing/20190328-1105/Test.xml
Error message was: The requested URL returned error: 503 Service Unavailable
Problems when submitting via HTTP
test submit error = -1
Error when uploading file: /scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_gcc_7.2.0/pull_request_test/Testing/20190328-1105/Upload.xml
Error message was: The requested URL returned error: 503 Service Unavailable
Problems when submitting via HTTP
File upload submit error = -1
Single configure/build/test failed. The error code was: 255
Build step 'Execute shell' marked build as failure
Archiving artifacts
Finished: FAILURE
I guess this means this branch is safe to merge, but now we have to set AT: RETEST
to repeat all of the builds again and hope this does not occur again. If this fails again I will need to directly merge the branch in #4750 to the atdm-nightly
branch in order to fix the SPARC 'master' + Trilinos 'develop' testing process.
Can the PR tester be set up so that if the configure, build and tests pass but upload the CDash fails, we still pass that PR build? I know that is not idea but considering the instability of these PR machines perhaps that is worth considering?
@trilinos/framework,
More CDash submit failures shown in https://github.com/trilinos/Trilinos/pull/4750#issuecomment-477722394, this time for the gcc_4.9.3 and gcc_7.2.0 builds. But the output is reporting that the configure, build, and tests all passed.
What is going on with all the CDash submit failures in PR builds? We don't seem to be seeing these in the ATDM Trilinos builds lately (on a very diverse set of machines).
More CDash submit failures shown in #4750 (comment), this time for the gcc_4.9.3 and gcc_7.2.0 builds. But the output is reporting that the configure, build, and tests all passed.
Same thing bit me yesterday too. Took 3 tries to get a PR through and the failed cdash submits were from the same machines.
Okay, so in the latest iteration shown in https://github.com/trilinos/Trilinos/pull/4750#issuecomment-477796722 the results submitted to CDash but the gcc-7.2.0 crashed with the randome build error shown here showing:
g++: internal compiler error: Killed (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <https://gcc.gnu.org/bugs/> for instructions.
and gcc_4.8.4 build crashed with random build error shown here showing:
collect2: error: ld terminated with signal 9 [Killed]
My guess is that the machines these are running on are overloaded and are running out of RAM and are crashing the compiler.
@trilinos/framework, team, can the build levels be made more conservative to avoid these types of build errors?
FYI: On the 4th try, all the PR builds passed and submitted to CDash as shown in https://github.com/trilinos/Trilinos/pull/4750#issuecomment-477865898. But the merge did not happen before the 'atdm-nightly' branch got updated last night at 10 PM mountain time so if I had not manually merged this branch to 'atdm-nightly', we would of had to wait an extra day to get SPARC fixed.
@trilinos/framework, @trilinos/muelu
More random build and test failures shown in https://github.com/trilinos/Trilinos/pull/4772#issuecomment-478399274 and https://github.com/trilinos/Trilinos/pull/4772#issuecomment-478443579 shown in CDash here.
Looks like the gcc_4.8.4 build failed to link libmuelu.so
both times with the error:
collect2: error: ld terminated with signal 9 [Killed]
And we saw a random test timeout for Ifpack2_Jac_sm_belos_MPI_1
for the gcc_7.2.0 build here.
Please help
CC: @mhoemmen
@trilinos/framework
More CDash upload failures crashing the PR testing iteration https://github.com/trilinos/Trilinos/pull/4791#issuecomment-479273970.
For example:
Build succeeded.
Error when uploading file: /scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_gcc_4.9.3_SERIAL/pull_request_test/Testing/20190402-2140/Build.xml
Error message was: The requested URL returned error: 503 Service Unavailable
Problems when submitting via HTTP
build submit error = -1
Starting testing step.
Tests succeeded.
Error when uploading file: /scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_gcc_4.9.3_SERIAL/pull_request_test/Testing/20190402-2140/Test.xml
Error message was: The requested URL returned error: 503 Service Unavailable
Problems when submitting via HTTP
test submit error = -1
Error when uploading file: /scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_gcc_4.9.3_SERIAL/pull_request_test/Testing/20190402-2140/Upload.xml
Error message was: The requested URL returned error: 503 Service Unavailable
Problems when submitting via HTTP
File upload submit error = -1
Single configure/build/test failed. The error code was: 255
Build step 'Execute shell' marked build as failure
Archiving artifacts
Finished: FAILURE
NOTE: The ATDM Trilinos CDash builds have not seen CDash upload failures in a long time (at least none that failed to submit to CDash). What is going on with these Trilinos PR builds having CDash upload failures?
@trilinos/framework
More CDash submit failures in https://github.com/trilinos/Trilinos/pull/4791#issuecomment-479359968.
Help!
What parameters are you using to ctest_submit @bartlettroscoe? I have currently got a retry of 5 and a delay of 3. I will have an hour or so this afternoon to look at this.
Paul
@prwolfe asked:
What parameters are you using to ctest_submit @bartlettroscoe? I have currently got a retry of 5 and a delay of 3. I will have an hour or so this afternoon to look at this.
Why are we not seeing retries in the STDOUT from ctest -S?
Not sure - it's in the code. Do you have an example of what that should look like?
@trilinos/frameork,
Looking at the PR builds on CDash today shown [here]() it looks like there might be something broken. The recent PR builds run in the last 2 hours have names:
Site | Build Name | Configure | Build | Test | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
tr-test-3.novalocal | PR-Werror_Pamgen-test-Trilinos_pullrequest_gcc_7.2.0-1200 | 0 | 0 | 4m 44s | 0 | 0 | 0s | 1 hour ago | (53 labels) | ||||
tr-test-1.novalocal | PR-Werror_Zoltan-test-Trilinos_pullrequest_gcc_7.2.0-1199 | 0 | 0 | 4m 28s | 0 | 0 | 0s | 1 hour ago | (53 labels) | ||||
ascic113 | PR-Werror_Triutils-test-Trilinos_pullrequest_gcc_7.2.0-1198 | 0 | 2 | 3m 45s | 0 | 0 | 0s | 1 hour ago | (53 labels) | ||||
ascic115 | PR-Werror_Tpetra-test-Trilinos_pullrequest_gcc_7.2.0-1197 | 0 | 2 | 3m 52s | 0 | 0 | 0s | 1 hour ago | (53 labels) | ||||
ascic158 | PR-Werror_ROL-test-Trilinos_pullrequest_gcc_7.2.0-1196 | 0 | 2 | 3m 4s | 0 | 0 | 0s | 1 hour ago | (53 labels) | ||||
ascic158 | PR-develop-test-Trilinos_pullrequest_gcc_7.2.0-1195 | 0 | 3 | 54s | 0 | 0 | 0s | 0 | 0 | 0 | 0s | 1 hour ago | (53 labels) |
tr-test-2.novalocal | PR-Werror_Isorropia-test-Trilinos_pullrequest_gcc_7.2.0-1194 | 0 | 0 | 4m 20s | 1 | 50 | 1h 51m 12s | 2 hours ago | (53 labels) | ||||
ascic114 | PR-Werror_FEI-test-Trilinos_pullrequest_gcc_7.2.0-1193 | 0 | 2 | 3m 44s | 0 | 0 | 0s | 2 hours ago | (53 labels) | ||||
tr-test-4.novalocal | PR-Werror_Epetra-test-Trilinos_pullrequest_gcc_7.2.0-1192 | 0 | 0 | 2m 59s | 12 | 50 | 1h 44m 12s | 2139 | 0 | 821 | 2m 54s | 2 hours ago | (53 labels) |
ascic166 | PR-Werror_Belos-test-Trilinos_pullrequest_gcc_7.2.0-1191 | 0 | 0 | 2m 33s | 11 | 50 | 1h 32m 11s | 1197 | 0 | 1763 | 6m 18s | 2 hours ago | (53 labels) |
ascic166 | PR-Werror_AztecOO-test-Trilinos_pullrequest_gcc_7.2.0-1190 | 0 | 0 | 2m 35s | 3 | 50 | 1h 32m 19s | 1322 | 0 | 1638 | 5m 59s | 2 hours ago | (53 labels) |
tr-test-0.novalocal | PR-Werror_Anasazi-test-Trilinos_pullrequest_gcc_7.2.0-1189 | 0 | 0 | 2m 48s | 9 | 50 | 1h 52m 13s | 2 hours ago | (53 labels) | ||||
ascic158 | PR-Werror_Amesos-test-Trilinos_pullrequest_gcc_7.2.0-1188 | 0 | 2 | 2m 45s | 0 | 0 | 0s | 2 hours ago | (53 labels) |
Where did the PR number go? Is the PR system broken?
@trilinos/framework, @trilinos/fei
The test FEI_fei_ubase_MPI_3
just randomly failed in my PR #4859 shown in https://github.com/trilinos/Trilinos/pull/4859#issuecomment-481990770 and took out the iteration. I know this is a random failure because it passed just fine the next PR testing iteration and I did not change anything in my PR branch as shown here.
It looks like this is not the first time this occurred (see https://github.com/trilinos/Trilinos/issues/1395#issuecomment-306329630). Going back over all of the PR testing history in the last 6 months as shown in this query we can see that this test has filed in 9 PR iterations and it occurred in all of the PR builds:
Site | Build Name | Test Name | Status | Time | Proc Time | Details | Build Time | Processors |
---|---|---|---|---|---|---|---|---|
tr-test-0.novalocal | PR-4859-test-Trilinos_pullrequest_gcc_4.8.4-3194 | FEI_fei_ubase_MPI_3 | Failed | 680ms | 2s 40ms | Completed (Failed) | 2019-04-11T01:59:26 UTC | 3 |
tr-test-1.novalocal | PR-4649-test-Trilinos_pullrequest_gcc_7.2.0-905 | FEI_fei_ubase_MPI_3 | Failed | 530ms | 1s 590ms | Completed (Failed) | 2019-03-22T15:52:04 UTC | 3 |
tr-test-0.novalocal | PR-4659-test-Trilinos_pullrequest_gcc_4.8.4-2897 | FEI_fei_ubase_MPI_3 | Failed | 690ms | 2s 70ms | Completed (Failed) | 2019-03-20T10:22:37 UTC | 3 |
tr-test-0.novalocal | PR-47-test-Trilinos_pullrequest_intel_17.0.1-40 | FEI_fei_ubase_MPI_3 | Failed | 250ms | 750ms | Completed (Failed) | 2019-02-25T14:26:08 UTC | 3 |
tr-test-0.novalocal | PR-4328-test-Trilinos_pullrequest_gcc_4.8.4-2410 | FEI_fei_ubase_MPI_3 | Failed | 1s 540ms | 4s 620ms | Completed (Failed) | 2019-02-08T18:54:50 UTC | 3 |
tr-test-0.novalocal | PR-4338-test-Trilinos_pullrequest_gcc_4.8.4-2389 | FEI_fei_ubase_MPI_3 | Failed | 1s | 3s | Completed (Failed) | 2019-02-06T20:46:42 UTC | 3 |
tr-test-1.novalocal | PR-7-test-Trilinos_pullrequest_intel_17.0.1-4 | FEI_fei_ubase_MPI_3 | Failed | 340ms | 1s 20ms | Completed (Failed) | 2018-12-18T15:51:20 UTC | 3 |
tr-test-1.novalocal | PR-7-test-Trilinos_pullrequest_intel_17.0.1-3 | FEI_fei_ubase_MPI_3 | Failed | 370ms | 1s 110ms | Completed (Failed) | 2018-12-13T16:30:05 UTC | 3 |
sisu.sandia.gov | PR-1000-test-Trilinos_pullrequest_gcc_4.9.3-9999 | FEI_fei_ubase_MPI_3 | Failed | 750ms | 2s 250ms | Completed (Failed) | 2018-11-29T20:41:06 UTC | 3 |
In all of these recent cases, the output looks like:
...
Total Time: 0.0106 sec
Summary: total = 54, run = 54, passed = 54, failed = 0
End Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name tr-test-0.novalocal and rank 1!
skipping test of fei::DirichletBCManager::finalizeBCEqn, which only runs on 1 proc.
test Eqns_unit.feiInitSlave only runs on 2 procs. returning.
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name tr-test-0.novalocal and rank 2!
skipping test of fei::DirichletBCManager::finalizeBCEqn, which only runs on 1 proc.
test Eqns_unit.feiInitSlave only runs on 2 procs. returning.
Result: TEST PASSED
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[21781,1],1]
Exit code: 1
--------------------------------------------------------------------------
I suspect that if you look at all of those PRs, they will not have anything to do with FEI.
Can we go ahead and disable this test in the PR builds? Randomly failing 9 times in the last 6 months does not sound that bad but this adds up and I am sure there are other random test failures bringing down PR test iterations as well. (If the PR tester would just re-run PR builds that failed, this would be less of an issue.)
@bartlettroscoe builds GCC 7.2.0 # 1188 - 1200 was me creating test builds for the remaining packages that have issues with -Werror
set. I labeled them all with "Werror_[package name]".
@ZUUL42 said:
@bartlettroscoe builds GCC 7.2.0 # 1188 - 1200 was me creating test builds for the remaining packages that have issues with
-Werror
set. I labeled them all with "Werror_[package name]".
In a case like this it may be better to post these to the "Experimental" CDash Track/Group than to send them to the "Pull Request" group. That would help to avoid confusion.
In a case like this it may be better to post these to the "Experimental" CDash Track/Group than to send them to the "Pull Request" group. That would help to avoid confusion.
I can do that. I've just been leaving PULLREQUEST_CDASH_TRACK as the default. Still picking up a few details here and there.
@trilinos/framework
Description
Over the last few weeks and months, the Trilinos auto PR tester has seen several cases where one or more PR builds for a given PR testing iteration failed to produce results on CDash or showed build or test failures that were not related to the changes on that particular PR.
This Story is to log these fails and keep track of them in order to provide some statistics about these cases in order to inform how to address them. This should replace making comments in individual PRs that exhibit these types of problems like #3260 and #3213.
PR Builds Showing Random Failures
Below are a few examples of the stability problems (but are not all of the problems).