Trilinos auto PR tester stability issues

bartlettroscoe commented 6 years ago

@trilinos/framework

Description

Over the last few weeks and months, the Trilinos auto PR tester has seen several cases where one or more PR builds for a given PR testing iteration failed to produce results on CDash or showed build or test failures that were not related to the changes on that particular PR.

This Story is to log these fails and keep track of them in order to provide some statistics about these cases in order to inform how to address them. This should replace making comments in individual PRs that exhibit these types of problems like #3260 and #3213.

PR Builds Showing Random Failures

Below are a few examples of the stability problems (but are not all of the problems).

PR ID	Num PR Builds to reach passing	First test trigger	Start first test	Passing test	Merge PR
#3258	2	8/8/2018 2:35 PM ET	8/8/2018 2:44 PM	[8/8/2018 9:15 PM ET]()	Not merged
#3260	4	8/8/2018 5:22 PM ET	8/8/2018 6:31 PM ET	8/10/2018 4:13 AM ET	8/10/2018 8:25 AM
#3213	3	7/31/2018 4:30 PM ET	7/31/2018 4:57 PM ET	8/1/2018 9:48 AM ET	8/1/2018 9:53 AM ET
#3098	4	7/12/2018 12:52 PM ET	7/12/2018 1:07 PM ET	7/13/2018 11:12 PM ET	7/14/2018 10:59 PM ET
#3369	6	8/29/2018 9:08 AM ET	8/29/2018 9:16 AM ET	8/31/2018 6:09 AM ET	8/31/2018 8:33 AM ET

rppawlo commented 5 years ago

I just rebased my branch in #4026 against develop and it still failed with the same error.

ZUUL42 commented 5 years ago

I see the problem. The 7.2.0 files are in master, but somehow they are currently missing from dev.

bartlettroscoe commented 5 years ago

I see the problem. The 7.2.0 files are in master, but somehow they are currently missing from dev.

@ZUUL42, you can just merge 'master' to 'develop'. That is accepted git practice.

bartlettroscoe commented 5 years ago

@trilinos/framework

From looking at the recent PRs, it looks like about 8 PRs are currently blocked by this bad Trilinos_pullrequest_gcc_7.2.0 build.

I added AT: RETEST to a few ATDM-related PRs to see if this autotester is working now.

bartlettroscoe commented 5 years ago

@trilinos/framework

Looks like the PR build Trilinos_pullrequest_gcc_4.9.3_SERIAL is busted. It took down two PR testing iterations so far https://github.com/trilinos/Trilinos/pull/4040#issuecomment-446773756 and https://github.com/trilinos/Trilinos/pull/4031#issuecomment-446789394. It is a configure failure as shown here for example showing:

Processing enabled TPL: BLAS (enabled explicitly, disable with -DTPL_ENABLE_BLAS=OFF)
-- BLAS_LIBRARY_NAMES='blas blas_win32'
-- Searching for libs in BLAS_LIBRARY_DIRS=''
-- Searching for a lib in the set "blas blas_win32":
--   Searching for lib 'blas' ...
--   Searching for lib 'blas_win32' ...
-- NOTE: Did not find a lib in the lib set "blas blas_win32" for the TPL 'BLAS'!
-- ERROR: Could not find the libraries for the TPL 'BLAS'!
-- TIP: If the TPL 'BLAS' is on your system then you can set:
     -DBLAS_LIBRARY_DIRS='<dir0>;<dir1>;...'
   to point to the directories where these libraries may be found.
   Or, just set:
     -DTPL_BLAS_LIBRARIES='<path-to-libs0>;<path-to-libs1>;...'
   to point to the full paths for the libraries which will
   bypass any search for libraries and these libraries will be used without
   question in the build.  (But this will result in a build-time error
   if not all of the necessary symbols are found.)
-- ERROR: Failed finding all of the parts of TPL 'BLAS' (see above), Aborting!

-- NOTE: The find module file for this failed TPL 'BLAS' is:
     /scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_gcc_4.9.3_SERIAL/Trilinos/cmake/tribits/common_tpls/FindTPLBLAS.cmake
   which is pointed to in the file:
     /scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_gcc_4.9.3_SERIAL/Trilinos/TPLsList.cmake

TIP: Even though the TPL 'BLAS' was explicitly enabled in input,
it can be disabled with:
  -DTPL_ENABLE_BLAS=OFF
which will disable it and will recursively disable all of the
downstream packages that have required dependencies on it.
When you reconfigure, just grep the cmake stdout for 'BLAS'
and then follow the disables that occur as a result to see what impact
this TPL disable has on the configuration of Trilinos.

CMake Error at cmake/tribits/core/package_arch/TribitsProcessEnabledTpl.cmake:144 (MESSAGE):
  ERROR: TPL_BLAS_NOT_FOUND=TRUE, aborting!
Call Stack (most recent call first):
  cmake/tribits/core/package_arch/TribitsGlobalMacros.cmake:1711 (TRIBITS_PROCESS_ENABLED_TPL)
  cmake/tribits/core/package_arch/TribitsProjectImpl.cmake:202 (TRIBITS_PROCESS_ENABLED_TPLS)
  cmake/tribits/core/package_arch/TribitsProject.cmake:93 (TRIBITS_PROJECT_IMPL)
  CMakeLists.txt:90 (TRIBITS_PROJECT)

jhux2 commented 5 years ago

I'm seeing some warnings in the PR configure process:

CMake Warning:
  Value of Trilinos_ENABLE_TESTS contained a newline; truncating

For example, see here.

bartlettroscoe commented 5 years ago

Value of Trilinos_ENABLE_TESTS contained a newline; truncating

I have never seen that before. Would need to see the ctest -S driver code to see what that is about.

prwolfe commented 5 years ago

@bartlettroscoe - the las errors are all on one machine - It looks like will realized that and took it back out of our pool. Not sure what the newline complaint is about though.

bartlettroscoe commented 5 years ago

FYI: Just get a really strange PR autotester crash just now in https://github.com/trilinos/Trilinos/pull/4064#issuecomment-447956256 showing:

Status Flag 'Pull Request AutoTester' - Failure: Timed out waiting for job Trilinos_pullrequest_intel_17.0.1 to start: Total Wait = 603

Other jobs have been previously started - We must stop them...

What does that mean?

bartlettroscoe commented 5 years ago

@trilinos/framework

More crashes in https://github.com/trilinos/Trilinos/pull/4079#issuecomment-448443534 and https://github.com/trilinos/Trilinos/pull/4079#issuecomment-448448274. This time, the Trilinos_pullrequest_gcc_7.2.0 build both times shows:

Checking out Revision 0d37651428ceee3028b577be97a9b87309bff684 (refs/remotes/origin/develop)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 0d37651428ceee3028b577be97a9b87309bff684
FATAL: Could not checkout 0d37651428ceee3028b577be97a9b87309bff684
hudson.plugins.git.GitException: Command "git checkout -f 0d37651428ceee3028b577be97a9b87309bff684" returned status code 128:
stdout: 
stderr: fatal: unable to write new index file
at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:2002)
at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.access$800(CliGitAPIImpl.java:72)
at org.jenkinsci.plugins.gitclient.CliGitAPIImpl$9.execute(CliGitAPIImpl.java:2301)

bartlettroscoe commented 5 years ago

@trilinos/framework,

And it crashed again https://github.com/trilinos/Trilinos/pull/4079#issuecomment-448452704. This time the build Trilinos_pullrequest_gcc_7.2.0 showed the error:

fatal: sha1 file '.git/objects/pack/tmp_pack_SSP6Df' write error: No space left on device
fatal: index-pack failed

Looks like the disk is filled up.

Do you (or Jenkins) have any system to warn you when a Jenkins slave disk space is getting to high?

bartlettroscoe commented 5 years ago

FYI: Just had a PR testing iteration fail as shown in https://github.com/trilinos/Trilinos/pull/4332#issuecomment-461130243 due to timeouts of some MueLu tests as shown in:

Are the nodes getting overloaded in the PR testing?

prwolfe commented 5 years ago

I don't think so. I was just looking because of something else I noticed and we still have some open space for jobs (about 80 processors). Plus this is not how Jenkins show overloading, it tends to just queue the jobs instead. I think we have some network issues (at a WAG). I will ask a few questions.

bartlettroscoe commented 5 years ago

FYI: PR #4336 testing iteration shown in https://github.com/trilinos/Trilinos/pull/4336#issuecomment-461234824 failed due to a single Intel compiler license problem in the super expensive Intel 17 build (took 5h40 min to build). Ouch

mhoemmen commented 5 years ago

@bartlettroscoe Is there a way to mark a PR test as "optional" like if it fails really quick can we just ignore it?

bartlettroscoe commented 5 years ago

@mhoemmen asked:

@bartlettroscoe Is there a way to mark a PR test as "optional" like if it fails really quick can we just ignore it?

That is up to the @trilinos/framework team (which I am not a part of).

What that PR tester should do is to made smart enough to see that if someone does a AT: RETEST and there are no changes to the topic branch, then it should leave the builds the already passed alone and just run the builds that failed again. And if it was really smart, if it saw that just tests failed (like in the case above), it would only rerun the tests. And if it was really really smart, it would on rerun the tests that failed. But all of that adds complexity to the PR tester.

bartlettroscoe commented 5 years ago

FYI: PR test iteration https://github.com/trilinos/Trilinos/pull/4332#issuecomment-461262726 also failed due to Intel compiler license problems. Got email that SNL's corporate license sever is done so the Intel build will fail in every PR test iteration until they get it fixed :-(

jmgate commented 5 years ago

But all of that add complexity to the PR tester.

But it should be done.

prwolfe commented 5 years ago

I think it's time to redesign this to remove complexity. This is functionality we need, but the current design is already more complex than it should be. We have talked about this some but I'm less sure when it happens.

bartlettroscoe commented 5 years ago

@trilinos/framework, given that the Intel PR build takes upwards of 6 hours to run and the fact we fairly often suffer build failures due to problems communicating with the license sever, I wonder if the payoff of having the Intel build is worth the cost? It would be good if we could go back over the last 6 months of history of PR testing and see how many real code or test failures the Intel PR build caught that one of the other GCC builds did not also catch? Given the way the PR builds are named, it would be hard to do this analysis (because it is hard to match up the different PR builds that are part of the same PR testing iteration).

Now if the PR tester was more robust so that it only reran the builds that failed, this would be mitigated some and we could tolerate some of these flaky builds better.

Just some thoughts ...

mhoemmen commented 5 years ago

In terms of catching build issues, the CUDA build is more valuable than the Intel build.

bartlettroscoe commented 5 years ago

FYI: The testing iteration in https://github.com/trilinos/Trilinos/pull/4336#issuecomment-461564312 showed a git update failure for the build Trilinos_pullrequest_gcc_4.8.4 showing:

 > git checkout -f e476b66bcdd0fb2efa6018a4d5ff10f5bb02d2b9
FATAL: Could not checkout e476b66bcdd0fb2efa6018a4d5ff10f5bb02d2b9
hudson.plugins.git.GitException: Command "git checkout -f e476b66bcdd0fb2efa6018a4d5ff10f5bb02d2b9" returned status code 128:
stdout: 
stderr: fatal: unable to write new index file
at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:2002)
at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.access$800(CliGitAPIImpl.java:72)
at org.jenkinsci.plugins.gitclient.CliGitAPIImpl$9.execute(CliGitAPIImpl.java:2301)

What does "unable to write new index file" mean? Does this mean that the Jenkins slave ran out of disk space? That is what is suggested by:

https://stackoverflow.com/questions/16064513/git-fatal-unable-to-write-new-index-file

prwolfe commented 5 years ago

Yes - I am working on this.

bartlettroscoe commented 5 years ago

@trilinos/framework,

FYI: Looking at this query and this query, it would seem that every PR testing iteration that runs on the node tr-test-0.novalocal results in internal compiler errors when trying to build MueLu with the GCC 4.8.4 build. In the last two days, this as impacted (at a minimum) the PRs #4320, #4328, #4336, #4342, #4351, #4354, and #4355.

Can someone please remove the Jenkins slave tr-test-0.novalocal until this problem can be resolved? We have been waiting to merge #4336 and fix the SPARC 'master' + Trilinos 'develop' since 2/6/2018. We are getting zero testing of SPARC 'master' + Trilinos 'develop' due to the inability to merge PR #4336.

bartlettroscoe commented 5 years ago

FYI: In PR #4336, we might have a new record for number of false PR testing attempts at 6 tries and 6 failures (none of which relate to the actually changes on the branch). I am trying again hopping that we might get lucky this time and not have the slave tr-test-0.novaloca being used.

bartlettroscoe commented 5 years ago

@trilinos/framework,

FYI: Note that this query shows that configure, build, or test failures (timeouts) are shown in every PR iteration run on the node tr-test-0.novalocal since 2/6/2019 (except where no packages were enabled). This shows this node needs to be removed from PR testing.

bartlettroscoe commented 5 years ago

FYI: The PR tester ran two more times and filed two more times as shown in https://github.com/trilinos/Trilinos/pull/4336#issuecomment-462089069 and https://github.com/trilinos/Trilinos/pull/4336#issuecomment-462108121 making it 0 for 8. I have put AT: WIP on #4336 for now since there is no sense in beating a dead horse. Once the PR system is fixed, I will turn this back on.

I am moving ATDM Trilinos testing to an 'atdm-develop-nightly' branch (see TRIL-260). One consequence of this is that I can directly merge this topic branch to 'atdm-develop-nightly' and fix the SPARC 'master' + Trilinos 'develop' builds, while this PR sits in libo. But once this PR can be merged to 'develop', then 'develop' will merge to 'atdm-develop-nightly' just fine. That is kind of a nice workflow for extreme cases like this actually.

bartlettroscoe commented 5 years ago

FYI: It took 5 1/2 days and 10 PR testing iterations to get a simple revert PR to merge (#4336). But now it is merged. (But I had already taken steps to get the SPARC 'master + Trilinos 'develop' testing back online.)

kddevin commented 5 years ago

@trilinos/framework @william76 @jwillenbring @ZUUL42

The PR tester is reporting g++ compiler internal errors for #4357. The PR adds code to Zoltan2, but the compiler internal errors occur when compiling MueLu files.

https://testing-vm.sandia.gov/cdash/index.php?project=Trilinos&parentid=4543744&filtercount=2&showfilters=1&field1=buildstarttime&compare1=84&value1=NOW&filtercombine=and

Is there something I can do to resolve this problem and get the PR merged? Using the instructions to reproduce PR errors, I was able to successfully build Trilinos on a linux workstation; the MueLu files reporting errors in CDASH above compiled without problem in my attempt to reproduce.

According to the google, these errors arise from insufficient memory; most recommendations are for reducing concurrency in the build.

jhux2 commented 5 years ago

I see an issue similar to what @kddevin reported: https://testing-vm.sandia.gov/cdash/viewBuildError.php?buildid=4553706.

Is there something unique to the test machine novalocal?

pr-fail

jhux2 commented 5 years ago

~~The CUDA PR tester is showing errors in ROL test that appear to be unrelated to my commits. See here.~~ Nevermind, I was looking at the wrong PR!

bartlettroscoe commented 5 years ago

@trilinos/framework,

Just saw in https://github.com/trilinos/Trilinos/pull/4629#issuecomment-476982448 what looks like a random build error for the GCC 4.8.4 build shown here showing:

collect2: error: ld terminated with signal 9 [Killed]

mhoemmen commented 5 years ago

That looks like the "linking MueLu ran out of memory" situation that we've seen sometimes.

bartlettroscoe commented 5 years ago

FYI: Looks like more random crashes and internal compiler warnings in https://github.com/trilinos/Trilinos/pull/4744#issuecomment-477567876. The build on CDash shown here show build errors like:

collect2: error: ld terminated with signal 9 [Killed]

and

g++: internal compiler error: Killed (program cc1plus)

That PR #4744 is just removing some warnings for unused variables so it is hard to see show that could cause errors like this.

bartlettroscoe commented 5 years ago

@trilinos/framework

The PR testing iteration https://github.com/trilinos/Trilinos/pull/4750#issuecomment-477625924 just failed due to the gcc_7.2.0 build not being able to submit to CDash but the configure, build, tests all passed showing:

Starting configure step.
...
   Error when uploading file: /scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_gcc_7.2.0/pull_request_test/Testing/20190328-1105/Configure.xml
   Error message was: The requested URL returned error: 503 Service Unavailable
   Problems when submitting via HTTP
configure submit error = -1
Configure suceeded.
Starting build step.
...
Build succeeded.
   Error when uploading file: /scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_gcc_7.2.0/pull_request_test/Testing/20190328-1105/Build.xml
   Error message was: The requested URL returned error: 503 Service Unavailable
   Problems when submitting via HTTP
build submit error = -1
Starting testing step.
Tests succeeded.
   Error when uploading file: /scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_gcc_7.2.0/pull_request_test/Testing/20190328-1105/Test.xml
   Error message was: The requested URL returned error: 503 Service Unavailable
   Problems when submitting via HTTP
test submit error = -1
   Error when uploading file: /scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_gcc_7.2.0/pull_request_test/Testing/20190328-1105/Upload.xml
   Error message was: The requested URL returned error: 503 Service Unavailable
   Problems when submitting via HTTP
File upload submit error = -1
Single configure/build/test failed. The error code was: 255
Build step 'Execute shell' marked build as failure
Archiving artifacts
Finished: FAILURE

I guess this means this branch is safe to merge, but now we have to set AT: RETEST to repeat all of the builds again and hope this does not occur again. If this fails again I will need to directly merge the branch in #4750 to the atdm-nightly branch in order to fix the SPARC 'master' + Trilinos 'develop' testing process.

Can the PR tester be set up so that if the configure, build and tests pass but upload the CDash fails, we still pass that PR build? I know that is not idea but considering the instability of these PR machines perhaps that is worth considering?

bartlettroscoe commented 5 years ago

@trilinos/framework,

More CDash submit failures shown in https://github.com/trilinos/Trilinos/pull/4750#issuecomment-477722394, this time for the gcc_4.9.3 and gcc_7.2.0 builds. But the output is reporting that the configure, build, and tests all passed.

What is going on with all the CDash submit failures in PR builds? We don't seem to be seeing these in the ATDM Trilinos builds lately (on a very diverse set of machines).

rppawlo commented 5 years ago

More CDash submit failures shown in #4750 (comment), this time for the gcc_4.9.3 and gcc_7.2.0 builds. But the output is reporting that the configure, build, and tests all passed.

Same thing bit me yesterday too. Took 3 tries to get a PR through and the failed cdash submits were from the same machines.

bartlettroscoe commented 5 years ago

Okay, so in the latest iteration shown in https://github.com/trilinos/Trilinos/pull/4750#issuecomment-477796722 the results submitted to CDash but the gcc-7.2.0 crashed with the randome build error shown here showing:

g++: internal compiler error: Killed (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <https://gcc.gnu.org/bugs/> for instructions.

and gcc_4.8.4 build crashed with random build error shown here showing:

collect2: error: ld terminated with signal 9 [Killed]

My guess is that the machines these are running on are overloaded and are running out of RAM and are crashing the compiler.

@trilinos/framework, team, can the build levels be made more conservative to avoid these types of build errors?

bartlettroscoe commented 5 years ago

FYI: On the 4th try, all the PR builds passed and submitted to CDash as shown in https://github.com/trilinos/Trilinos/pull/4750#issuecomment-477865898. But the merge did not happen before the 'atdm-nightly' branch got updated last night at 10 PM mountain time so if I had not manually merged this branch to 'atdm-nightly', we would of had to wait an extra day to get SPARC fixed.

bartlettroscoe commented 5 years ago

@trilinos/framework, @trilinos/muelu

More random build and test failures shown in https://github.com/trilinos/Trilinos/pull/4772#issuecomment-478399274 and https://github.com/trilinos/Trilinos/pull/4772#issuecomment-478443579 shown in CDash here.

Looks like the gcc_4.8.4 build failed to link libmuelu.so both times with the error:

collect2: error: ld terminated with signal 9 [Killed]

And we saw a random test timeout for Ifpack2_Jac_sm_belos_MPI_1 for the gcc_7.2.0 build here.

Please help

bartlettroscoe commented 5 years ago

CC: @mhoemmen

@trilinos/framework

More CDash upload failures crashing the PR testing iteration https://github.com/trilinos/Trilinos/pull/4791#issuecomment-479273970.

For example:

Build succeeded.
   Error when uploading file: /scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_gcc_4.9.3_SERIAL/pull_request_test/Testing/20190402-2140/Build.xml
   Error message was: The requested URL returned error: 503 Service Unavailable
   Problems when submitting via HTTP
build submit error = -1
Starting testing step.
Tests succeeded.
   Error when uploading file: /scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_gcc_4.9.3_SERIAL/pull_request_test/Testing/20190402-2140/Test.xml
   Error message was: The requested URL returned error: 503 Service Unavailable
   Problems when submitting via HTTP
test submit error = -1
   Error when uploading file: /scratch/trilinos/workspace/trilinos-folder/Trilinos_pullrequest_gcc_4.9.3_SERIAL/pull_request_test/Testing/20190402-2140/Upload.xml
   Error message was: The requested URL returned error: 503 Service Unavailable
   Problems when submitting via HTTP
File upload submit error = -1
Single configure/build/test failed. The error code was: 255
Build step 'Execute shell' marked build as failure
Archiving artifacts
Finished: FAILURE

NOTE: The ATDM Trilinos CDash builds have not seen CDash upload failures in a long time (at least none that failed to submit to CDash). What is going on with these Trilinos PR builds having CDash upload failures?

bartlettroscoe commented 5 years ago

@trilinos/framework

More CDash submit failures in https://github.com/trilinos/Trilinos/pull/4791#issuecomment-479359968.

Help!

prwolfe commented 5 years ago

What parameters are you using to ctest_submit @bartlettroscoe? I have currently got a retry of 5 and a delay of 3. I will have an hour or so this afternoon to look at this.

Paul

bartlettroscoe commented 5 years ago

@prwolfe asked:

What parameters are you using to ctest_submit @bartlettroscoe? I have currently got a retry of 5 and a delay of 3. I will have an hour or so this afternoon to look at this.

Why are we not seeing retries in the STDOUT from ctest -S?

prwolfe commented 5 years ago

Not sure - it's in the code. Do you have an example of what that should look like?

bartlettroscoe commented 5 years ago

@trilinos/frameork,

Looking at the PR builds on CDash today shown [here]() it looks like there might be something broken. The recent PR builds run in the last 2 hours have names:

Site	Build Name	Build	Test
tr-test-3.novalocal	PR-Werror_Pamgen-test-Trilinos_pullrequest_gcc_7.2.0-1200	0	4m 44s	0	0	0s					1 hour ago	(53 labels)
tr-test-1.novalocal	PR-Werror_Zoltan-test-Trilinos_pullrequest_gcc_7.2.0-1199	0	4m 28s	0	0	0s					1 hour ago	(53 labels)
ascic113	PR-Werror_Triutils-test-Trilinos_pullrequest_gcc_7.2.0-1198	2	3m 45s	0	0	0s					1 hour ago	(53 labels)
ascic115	PR-Werror_Tpetra-test-Trilinos_pullrequest_gcc_7.2.0-1197	2	3m 52s	0	0	0s					1 hour ago	(53 labels)
ascic158	PR-Werror_ROL-test-Trilinos_pullrequest_gcc_7.2.0-1196	2	3m 4s	0	0	0s					1 hour ago	(53 labels)
ascic158	PR-develop-test-Trilinos_pullrequest_gcc_7.2.0-1195	3	54s	0	0	0s	0	0	0	0s	1 hour ago	(53 labels)
tr-test-2.novalocal	PR-Werror_Isorropia-test-Trilinos_pullrequest_gcc_7.2.0-1194	0	4m 20s	1	50	1h 51m 12s					2 hours ago	(53 labels)
ascic114	PR-Werror_FEI-test-Trilinos_pullrequest_gcc_7.2.0-1193	2	3m 44s	0	0	0s					2 hours ago	(53 labels)
tr-test-4.novalocal	PR-Werror_Epetra-test-Trilinos_pullrequest_gcc_7.2.0-1192	0	2m 59s	12	50	1h 44m 12s	2139	0	821	2m 54s	2 hours ago	(53 labels)
ascic166	PR-Werror_Belos-test-Trilinos_pullrequest_gcc_7.2.0-1191	0	2m 33s	11	50	1h 32m 11s	1197	0	1763	6m 18s	2 hours ago	(53 labels)
ascic166	PR-Werror_AztecOO-test-Trilinos_pullrequest_gcc_7.2.0-1190	0	2m 35s	3	50	1h 32m 19s	1322	0	1638	5m 59s	2 hours ago	(53 labels)
tr-test-0.novalocal	PR-Werror_Anasazi-test-Trilinos_pullrequest_gcc_7.2.0-1189	0	2m 48s	9	50	1h 52m 13s					2 hours ago	(53 labels)
ascic158	PR-Werror_Amesos-test-Trilinos_pullrequest_gcc_7.2.0-1188	2	2m 45s	0	0	0s					2 hours ago	(53 labels)

Where did the PR number go? Is the PR system broken?

bartlettroscoe commented 5 years ago

@trilinos/framework, @trilinos/fei

The test FEI_fei_ubase_MPI_3 just randomly failed in my PR #4859 shown in https://github.com/trilinos/Trilinos/pull/4859#issuecomment-481990770 and took out the iteration. I know this is a random failure because it passed just fine the next PR testing iteration and I did not change anything in my PR branch as shown here.

It looks like this is not the first time this occurred (see https://github.com/trilinos/Trilinos/issues/1395#issuecomment-306329630). Going back over all of the PR testing history in the last 6 months as shown in this query we can see that this test has filed in 9 PR iterations and it occurred in all of the PR builds:

Site	Build Name	Test Name	Status	Time	Proc Time	Details	Build Time	Processors
tr-test-0.novalocal	PR-4859-test-Trilinos_pullrequest_gcc_4.8.4-3194	FEI_fei_ubase_MPI_3	Failed	680ms	2s 40ms	Completed (Failed)	2019-04-11T01:59:26 UTC	3
tr-test-1.novalocal	PR-4649-test-Trilinos_pullrequest_gcc_7.2.0-905	FEI_fei_ubase_MPI_3	Failed	530ms	1s 590ms	Completed (Failed)	2019-03-22T15:52:04 UTC	3
tr-test-0.novalocal	PR-4659-test-Trilinos_pullrequest_gcc_4.8.4-2897	FEI_fei_ubase_MPI_3	Failed	690ms	2s 70ms	Completed (Failed)	2019-03-20T10:22:37 UTC	3
tr-test-0.novalocal	PR-47-test-Trilinos_pullrequest_intel_17.0.1-40	FEI_fei_ubase_MPI_3	Failed	250ms	750ms	Completed (Failed)	2019-02-25T14:26:08 UTC	3
tr-test-0.novalocal	PR-4328-test-Trilinos_pullrequest_gcc_4.8.4-2410	FEI_fei_ubase_MPI_3	Failed	1s 540ms	4s 620ms	Completed (Failed)	2019-02-08T18:54:50 UTC	3
tr-test-0.novalocal	PR-4338-test-Trilinos_pullrequest_gcc_4.8.4-2389	FEI_fei_ubase_MPI_3	Failed	1s	3s	Completed (Failed)	2019-02-06T20:46:42 UTC	3
tr-test-1.novalocal	PR-7-test-Trilinos_pullrequest_intel_17.0.1-4	FEI_fei_ubase_MPI_3	Failed	340ms	1s 20ms	Completed (Failed)	2018-12-18T15:51:20 UTC	3
tr-test-1.novalocal	PR-7-test-Trilinos_pullrequest_intel_17.0.1-3	FEI_fei_ubase_MPI_3	Failed	370ms	1s 110ms	Completed (Failed)	2018-12-13T16:30:05 UTC	3
sisu.sandia.gov	PR-1000-test-Trilinos_pullrequest_gcc_4.9.3-9999	FEI_fei_ubase_MPI_3	Failed	750ms	2s 250ms	Completed (Failed)	2018-11-29T20:41:06 UTC	3

In all of these recent cases, the output looks like:

...
Total Time: 0.0106 sec

Summary: total = 54, run = 54, passed = 54, failed = 0

End Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name tr-test-0.novalocal and rank 1!
skipping test of fei::DirichletBCManager::finalizeBCEqn, which only runs on 1 proc.
test Eqns_unit.feiInitSlave only runs on 2 procs. returning.
Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name tr-test-0.novalocal and rank 2!
skipping test of fei::DirichletBCManager::finalizeBCEqn, which only runs on 1 proc.
test Eqns_unit.feiInitSlave only runs on 2 procs. returning.
Result: TEST PASSED
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[21781,1],1]
  Exit code:    1
--------------------------------------------------------------------------

I suspect that if you look at all of those PRs, they will not have anything to do with FEI.

Can we go ahead and disable this test in the PR builds? Randomly failing 9 times in the last 6 months does not sound that bad but this adds up and I am sure there are other random test failures bringing down PR test iterations as well. (If the PR tester would just re-run PR builds that failed, this would be less of an issue.)

ZUUL42 commented 5 years ago

@bartlettroscoe builds GCC 7.2.0 # 1188 - 1200 was me creating test builds for the remaining packages that have issues with -Werror set. I labeled them all with "Werror_[package name]".

bartlettroscoe commented 5 years ago

@ZUUL42 said:

@bartlettroscoe builds GCC 7.2.0 # 1188 - 1200 was me creating test builds for the remaining packages that have issues with -Werror set. I labeled them all with "Werror_[package name]".

In a case like this it may be better to post these to the "Experimental" CDash Track/Group than to send them to the "Pull Request" group. That would help to avoid confusion.

ZUUL42 commented 5 years ago

In a case like this it may be better to post these to the "Experimental" CDash Track/Group than to send them to the "Pull Request" group. That would help to avoid confusion.

I can do that. I've just been leaving PULLREQUEST_CDASH_TRACK as the default. Still picking up a few details here and there.

trilinos / Trilinos

Trilinos auto PR tester stability issues #3276

Description

PR Builds Showing Random Failures