Address basic stability of Trilinos 'develop' branch short-term

bartlettroscoe commented 7 years ago

Related to: #1362

Next Action Status:

The auto PR testing process (#1155) is deployed and is working fairly well to stabilize 'develop' (at least as good or better than the checkin-test-sems.sh script did). Further improvements will be worked in other issues.

Description:

This story is to discuss and decide how to address stability problems of the Trilinos 'develop' branch in the short term. I know there is a long-term plan to use a PR model (see #1155) but since there are no updates or ETA on that, we need to address stability issues faster than that.

Currently there have been a good deal of stability problems of the Trilinos 'develop' branch, even with the basic CI build linked to from:

https://github.com/trilinos/Trilinos/wiki/Policies--%7C-Testing#post_push_ci_testing

and the "Clean" builds shown here:

http://testing.sandia.gov/cdash/index.php?project=Trilinos&display=project&filtercount=2&showfilters=1&filtercombine=and&field1=groupname&compare1=61&value1=Clean&field2=buildstarttime&compare2=84&value2=now

The "Clean" builds have never been clean in the entire history of the track.

Some very recent examples of failures causing this are described in #1290 and #1301. These have broken the standard CI build and the "Clean" builds continuously since May 4 (and it is still broken as I type this).

We need a strategy to improve stability right now. I have been helping people set up to use the checkin-test-sems.sh script to test and push their changes. I would estimate that a large percentage of the failures (and 100% of the CI failures) seen on CDash would be avoided by usage of the checkin-test-sems.sh script.

CC: @trilinos/framework

bartlettroscoe commented 7 years ago

The CI build running on ceerws1113 associated with the checkin-test-sems.sh script is now clean but the other CI build, and all of the other SEACAS builds are still showing these failures. The @trilinos/framework team will need to decide how to address these.

bartlettroscoe commented 6 years ago

Adding to the log of CI failures ...

Just noticed that the CI build was briefly broken (a Panzer build failure) for the iteration started 2017-12-07 23:19:00 UTC shown at:

https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3261708

This was due to the direct commit and push to the develop branch:

Thu Dec  7 10:43:24 MST 2017

commit 60123c8fb27fefea8dca350b675162bbdab7d2c7
Author:     Edward G. Phillips <egphill@sandia.gov>
AuthorDate: Wed Dec 6 15:10:45 2017 -0700
Commit:     Edward G. Phillips <egphill@sandia.gov>
CommitDate: Thu Dec 7 10:42:30 2017 -0700

    Panzer: Added construction of discrete gradient to mini-em

Commits pushed:
60123c8 Panzer: Added construction of discrete gradient to mini-em

This was then fixed the very next CI iteration started at 2017-12-08 03:36:00 UTC:

https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3261778

with the push:

Thu Dec  7 14:56:06 MST 2017

commit f5d0dcae2d065e3cd32ba22b8db1170dd58a3083
Author:     Edward G. Phillips <egphill@sandia.gov>
AuthorDate: Thu Dec 7 14:54:11 2017 -0700
Commit:     Edward G. Phillips <egphill@sandia.gov>
CommitDate: Thu Dec 7 14:54:11 2017 -0700

    Panzer: fixing last commit. Commenting out missing refMaxwell option.

Commits pushed:
f5d0dca Panzer: fixing last commit. Commenting out missing refMaxwell option.

It does not look like the initial push used the checkin-test-sems.sh script or the new automated PR testing.

bartlettroscoe commented 6 years ago

FYI: There was a single build failure in the CI build this morning show at:

https://testing.sandia.gov/cdash/viewBuildError.php?buildid=3274780

which showed:

g++: internal compiler error: Killed (program cc1plus)
0x40b368 execute
    ../.././gcc/gcc.c:2823
0x40b6b4 do_spec_1
    ../.././gcc/gcc.c:4615
0x40e1bb process_brace_body
    ../.././gcc/gcc.c:5872
0x40e1bb handle_braces
    ../.././gcc/gcc.c:5786
0x40c65e do_spec_1
    ../.././gcc/gcc.c:5269
0x40e1bb process_brace_body
    ../.././gcc/gcc.c:5872
0x40e1bb handle_braces
    ../.././gcc/gcc.c:5786
0x40c65e do_spec_1
    ../.././gcc/gcc.c:5269
0x40c3e8 do_spec_1
    ../.././gcc/gcc.c:5374
0x40e1bb process_brace_body
    ../.././gcc/gcc.c:5872
0x40e1bb handle_braces
    ../.././gcc/gcc.c:5786
0x40c65e do_spec_1
    ../.././gcc/gcc.c:5269
0x40e1bb process_brace_body
    ../.././gcc/gcc.c:5872
0x40e1bb handle_braces
    ../.././gcc/gcc.c:5786
0x40c65e do_spec_1
    ../.././gcc/gcc.c:5269
0x40e1bb process_brace_body
    ../.././gcc/gcc.c:5872
0x40e1bb handle_braces
    ../.././gcc/gcc.c:5786
0x40c65e do_spec_1
    ../.././gcc/gcc.c:5269
0x40e1bb process_brace_body
    ../.././gcc/gcc.c:5872
0x40e1bb handle_braces
    ../.././gcc/gcc.c:5786
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <http://gcc.gnu.org/bugs.html> for instructions.

This looks to be some type of fluke. Not sure what happened but I don't suspect that we will see this again.

bartlettroscoe commented 6 years ago

FYI: The CI build on ceerws1113 did not run correctly yesterday morning or this morning. The log file on the machine showed it is hanging on the initial configure of the first package Gtest. I am ran another CMake configure on that machine and it is hanging right after printing:

-- Found Doxygen: /usr/bin/doxygen (found version "1.6.1")

I don't know what CMake is doing that is causing it to hang after this but I would assume it is a problem with the system. I killed the CI server both days. I will kill it again tomorrow if it does not start up correctly Monday morning.

ibaned commented 6 years ago

@bartlettroscoe something very similar happened before, and boiled down to a badly mounted root directory that CMake tried to search while looking for something else. I was able to narrow this down by running strace cmake $CONFIG_ARGS

bartlettroscoe commented 6 years ago

Following up from my last comment, the CI build shown at:

https://testing.sandia.gov/cdash/index.php?project=Trilinos&filtercount=3&showfilters=1&filtercombine=and&field1=buildname&compare1=66&value1=-MPI_RELEASE_DEBUG_SHARED_PT_CI&field2=groupname&compare2=61&value2=Continuous&field3=buildstarttime&compare3=84&value3=now

started and ran yesterday and today. However, yesterday, it only displayed 45 packages when it should have displayed 53 packages. It is running again today but is not finished the initial CI iteration yet.

bartlettroscoe commented 6 years ago

FYI: The current CI iteration at:

https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3287892

just showed a failing test FEI_fei_ubase_MPI_4. This appears to be due to jumbled multi-process output and not due to any changes in the code (see #2094) . This type of random failure is worth mentioning due to the PR testing process in #1155. But I suspect we will not see this error for a while.

bartlettroscoe commented 6 years ago

FYI: Just got word that the PR autotester system id down and will not likely get fixed until Monday. Therefore people will need to hold off on pushes or will need to use the checkin-test-sems.sh script to safely push to Trilinos until this can be fixed.

bartlettroscoe commented 6 years ago

FYI: A change to the ETI files broke the Panzer rebuild yesterday shown in the CI iteration:

https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3314508

and repeated in the next CI iteration. But as @rppawlo informed us offline yesterday, this was cleaned up automatically when the build directory was blow away with the first fresh CI iteration this morning as shown at:

https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3315817

But now there is a Tempus test failure that needs to be addressed. I will address that in a separate issue.

ikalash commented 6 years ago

@ccober6 and I are aware of the failing test and Curt is working on a fix.

ccober6 commented 6 years ago

Yeah, my fix is running on ceerws1113 right now. Hopefully it will be done in few minutes.

Curt

On Jan 6, 2018, at 4:38 PM, Irina K. Tezaur notifications@github.com<mailto:notifications@github.com> wrote:

@ccober6https://github.com/ccober6 and I are aware of the failing test and Curt is working on a fix.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/trilinos/Trilinos/issues/1304#issuecomment-355786482, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ASw4yqdU7-6Ie4X4ZRJsjulL6HGDEae2ks5tIAPlgaJpZM4NVhgk.

ccober6 commented 6 years ago

The Tempus issue should now be fixed.

bartlettroscoe commented 6 years ago

FYI: @rppawlo just informed me that another set of changes to the Panzer ETI files was pushed that requires deleting the build directory. We see that this has impacted the most recent CI build at:

https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3319509

I will restart the CI server on ceerws1113 and that should take care of the problem. The @trilinos/framework team might want to do the same thing with the CI server that they run. This might also impact that auto PR testing (but not sure if that is back online yet).

bartlettroscoe commented 6 years ago

FYI: Looks like it happened again. Two CI builds are running on top of each other on ceerws1113 and causing random configure and build failures in the last two CI iterations with build stamps 20180114-1100-Continuous and 20180115-1100-Continuous. Two jobs are shown running at the same time on the machine:

ps -AF | grep trilinos_ci_server.out
rabartl  28494 28493  0 26527  1172   1 04:00 ?        00:00:00 /bin/sh -c cd /scratch/rabartl/Trilinos.base/SEMSCIBuild && cp trilinos_ci_server.out trilinos_ci_server.last.out && ./Trilinos/cmake/ctest/drivers/sems_ci/trilinos_ci_sever.sh &> trilinos_ci_server.out
rabartl  42175 21021  0 25834   956   1 08:41 pts/4    00:00:00 grep trilinos_ci_server.out
rabartl  45160 45156  0 26527  1172  22 Jan14 ?        00:00:00 /bin/sh -c cd /scratch/rabartl/Trilinos.base/SEMSCIBuild && cp trilinos_ci_server.out trilinos_ci_server.last.out && ./Trilinos/cmake/ctest/drivers/sems_ci/trilinos_ci_sever.sh &> trilinos_ci_server.out

I will killed the jobs and restart the CI server. But it is time for a solution so that this does not happen again.

ibaned commented 6 years ago

@bartlettroscoe I think this may have to do with issue with an /ml_cap mounting issue that I observed earlier, and seems to have happened again at least yesterday on CEE machines. This causes CMake to hang indefinitely while searching for Doxygen. It may not be caused by simultaneous builds.

bartlettroscoe commented 6 years ago

FYI: The CI build was broken continuously since 2/15/2018 (the checkin-test-sems.sh script was not allowed to test downstream packages, see #2254) but I did not notice because the CDash emails were not going out (see #2255) . I backed out that commit today and got one passing CI build iteration but then another PR branch was merged that broke the CI build again (see #2264). But that was related to PR #2171. But we can't see what was tested and what passed. But I will fix the standard CI build right now, see #2264 for more details.

bartlettroscoe commented 6 years ago

FYI: Those using the checkin-test-sems.sh script to push will now be in good shape but the automated PR testing of every branch merged against 'develop' should fail (if it involves changes to MueLu or upstream packages). See:

https://github.com/trilinos/Trilinos/issues/2264#issuecomment-367011877

bartlettroscoe commented 6 years ago

FYI: An untested push to Trilinos just broke the Stokhos library build (see #2315) and therefore breaks everyone using checkin-test-sems.sh and will break all of the auto PR builds. I am running the checkin-test-sems.sh script right now to push a backout of this commit.

bartlettroscoe commented 6 years ago

FYI: I pushed the revert commit for #2315 so the CI build. The next CI iteration should be clean of this failure.

bartlettroscoe commented 6 years ago

And the next CI build was clean. All good now. See https://github.com/trilinos/Trilinos/issues/2315#issuecomment-369776161.

bartlettroscoe commented 6 years ago

FYI: There was a global configure failure in the CI build that was caused by a simplification I made in commit dd67e68340b725584ed927c0add529997515ff98 (see #2378). I forgot to update the ctest -S driver script to explicitly pass in SEMSDevEnv.cmake. I have fixed this and pushed it and am testing locally now. Once that local testing is complete, I will restart the CI server.

As part of this I might consider moving this CI server to a jenkins-srn machine. But I worry that might destabilize the CI server given problems with overloading of the Jenkins machines we have seen on the Jenkins machines so that might not be a good idea. I am not sure we can trust a CI server to these Jenkins build farms the way they are currently set up and are being used.

bartlettroscoe commented 6 years ago

FYI: I sent out the following email about the failures due to this transition in #2378.

From: Bartlett, Roscoe A Sent: Wednesday, March 14, 2018 8:56 AM To: trilinos-developers@trilinos.org Subject: Ignore CDash failures from two builds this morning

Hello Trilinos Developers,

Please disregard the CDash configure failure emails coming from the builds Linux-GCC-4.8.4-MPI_RELEASE_DEBUG_SHARED_PT_CI and Trilinos-atdm-sems-gcc-7-2-0. This was from a global configure problem for these two builds that caused every package configure to fail.

The problem should be solved now.

-Ross

bartlettroscoe commented 6 years ago

FYI: The problems have been resolved and the CI build is running again as shown at:

https://testing-vm.sandia.gov/cdash/index.php?project=Trilinos&parentid=3368071

Sorry for everyone getting spammed by CDash this morning. Hopefully this will not lead to people putting in email filters for CDash error emails :-(

bartlettroscoe commented 6 years ago

FYI: The CI build was fixed yesterday as noted above but the first CI iteration today showed the hanging test Zoltan2_OrderingScotch_MPI_4. This test is being disabled for this CI build and should not be a problem in the future. See #2397, #2131. But this shows that the auto PR build needs to enable Scotch as well (see #2065).

bartlettroscoe commented 6 years ago

FYI: The CI build was clean in the next CI iteration as shown at:

https://testing-vm.sandia.gov/cdash/index.php?project=Trilinos&parentid=3372395&filtercount=3&showfilters=1&field1=groupname&compare1=61&value1=Continuous&field2=buildstarttime&compare2=84&value2=now&filtercombine=and

bartlettroscoe commented 6 years ago

Given the new and improving auto PR testing process, that process is doing a good basic job of stabilizing the 'develop' branch of Trilinos (as good or better than the usage of the checkin-test-sems.sh script). I can be improved (buy adding a CUDA build and tweaking some other builds as described in #2317) but I think this is pretty good.

We don't need this issue anymore.

@trilinos/framework team, thanks for getting that auto PR testing process set up and continuously improving it!

Closing this issue as complete.

trilinos / Trilinos

Address basic stability of Trilinos 'develop' branch short-term #1304