Closed bartlettroscoe closed 6 years ago
The CI build running on ceerws1113 associated with the checkin-test-sems.sh script is now clean but the other CI build, and all of the other SEACAS builds are still showing these failures. The @trilinos/framework team will need to decide how to address these.
Adding to the log of CI failures ...
Just noticed that the CI build was briefly broken (a Panzer build failure) for the iteration started 2017-12-07 23:19:00 UTC shown at:
This was due to the direct commit and push to the develop
branch:
Thu Dec 7 10:43:24 MST 2017
commit 60123c8fb27fefea8dca350b675162bbdab7d2c7
Author: Edward G. Phillips <egphill@sandia.gov>
AuthorDate: Wed Dec 6 15:10:45 2017 -0700
Commit: Edward G. Phillips <egphill@sandia.gov>
CommitDate: Thu Dec 7 10:42:30 2017 -0700
Panzer: Added construction of discrete gradient to mini-em
Commits pushed:
60123c8 Panzer: Added construction of discrete gradient to mini-em
This was then fixed the very next CI iteration started at 2017-12-08 03:36:00 UTC:
with the push:
Thu Dec 7 14:56:06 MST 2017
commit f5d0dcae2d065e3cd32ba22b8db1170dd58a3083
Author: Edward G. Phillips <egphill@sandia.gov>
AuthorDate: Thu Dec 7 14:54:11 2017 -0700
Commit: Edward G. Phillips <egphill@sandia.gov>
CommitDate: Thu Dec 7 14:54:11 2017 -0700
Panzer: fixing last commit. Commenting out missing refMaxwell option.
Commits pushed:
f5d0dca Panzer: fixing last commit. Commenting out missing refMaxwell option.
It does not look like the initial push used the checkin-test-sems.sh script or the new automated PR testing.
FYI: There was a single build failure in the CI build this morning show at:
which showed:
g++: internal compiler error: Killed (program cc1plus)
0x40b368 execute
../.././gcc/gcc.c:2823
0x40b6b4 do_spec_1
../.././gcc/gcc.c:4615
0x40e1bb process_brace_body
../.././gcc/gcc.c:5872
0x40e1bb handle_braces
../.././gcc/gcc.c:5786
0x40c65e do_spec_1
../.././gcc/gcc.c:5269
0x40e1bb process_brace_body
../.././gcc/gcc.c:5872
0x40e1bb handle_braces
../.././gcc/gcc.c:5786
0x40c65e do_spec_1
../.././gcc/gcc.c:5269
0x40c3e8 do_spec_1
../.././gcc/gcc.c:5374
0x40e1bb process_brace_body
../.././gcc/gcc.c:5872
0x40e1bb handle_braces
../.././gcc/gcc.c:5786
0x40c65e do_spec_1
../.././gcc/gcc.c:5269
0x40e1bb process_brace_body
../.././gcc/gcc.c:5872
0x40e1bb handle_braces
../.././gcc/gcc.c:5786
0x40c65e do_spec_1
../.././gcc/gcc.c:5269
0x40e1bb process_brace_body
../.././gcc/gcc.c:5872
0x40e1bb handle_braces
../.././gcc/gcc.c:5786
0x40c65e do_spec_1
../.././gcc/gcc.c:5269
0x40e1bb process_brace_body
../.././gcc/gcc.c:5872
0x40e1bb handle_braces
../.././gcc/gcc.c:5786
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <http://gcc.gnu.org/bugs.html> for instructions.
This looks to be some type of fluke. Not sure what happened but I don't suspect that we will see this again.
FYI: The CI build on ceerws1113 did not run correctly yesterday morning or this morning. The log file on the machine showed it is hanging on the initial configure of the first package Gtest. I am ran another CMake configure on that machine and it is hanging right after printing:
-- Found Doxygen: /usr/bin/doxygen (found version "1.6.1")
I don't know what CMake is doing that is causing it to hang after this but I would assume it is a problem with the system. I killed the CI server both days. I will kill it again tomorrow if it does not start up correctly Monday morning.
@bartlettroscoe something very similar happened before, and boiled down to a badly mounted root directory that CMake tried to search while looking for something else. I was able to narrow this down by running strace cmake $CONFIG_ARGS
Following up from my last comment, the CI build shown at:
started and ran yesterday and today. However, yesterday, it only displayed 45 packages when it should have displayed 53 packages. It is running again today but is not finished the initial CI iteration yet.
FYI: The current CI iteration at:
just showed a failing test FEI_fei_ubase_MPI_4
. This appears to be due to jumbled multi-process output and not due to any changes in the code (see #2094) . This type of random failure is worth mentioning due to the PR testing process in #1155. But I suspect we will not see this error for a while.
FYI: Just got word that the PR autotester system id down and will not likely get fixed until Monday. Therefore people will need to hold off on pushes or will need to use the checkin-test-sems.sh script to safely push to Trilinos until this can be fixed.
FYI: A change to the ETI files broke the Panzer rebuild yesterday shown in the CI iteration:
and repeated in the next CI iteration. But as @rppawlo informed us offline yesterday, this was cleaned up automatically when the build directory was blow away with the first fresh CI iteration this morning as shown at:
But now there is a Tempus test failure that needs to be addressed. I will address that in a separate issue.
@ccober6 and I are aware of the failing test and Curt is working on a fix.
Yeah, my fix is running on ceerws1113 right now. Hopefully it will be done in few minutes.
Curt
On Jan 6, 2018, at 4:38 PM, Irina K. Tezaur notifications@github.com<mailto:notifications@github.com> wrote:
@ccober6https://github.com/ccober6 and I are aware of the failing test and Curt is working on a fix.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/trilinos/Trilinos/issues/1304#issuecomment-355786482, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ASw4yqdU7-6Ie4X4ZRJsjulL6HGDEae2ks5tIAPlgaJpZM4NVhgk.
The Tempus issue should now be fixed.
FYI: @rppawlo just informed me that another set of changes to the Panzer ETI files was pushed that requires deleting the build directory. We see that this has impacted the most recent CI build at:
I will restart the CI server on ceerws1113 and that should take care of the problem. The @trilinos/framework team might want to do the same thing with the CI server that they run. This might also impact that auto PR testing (but not sure if that is back online yet).
FYI: Looks like it happened again. Two CI builds are running on top of each other on ceerws1113 and causing random configure and build failures in the last two CI iterations with build stamps 20180114-1100-Continuous and 20180115-1100-Continuous. Two jobs are shown running at the same time on the machine:
ps -AF | grep trilinos_ci_server.out
rabartl 28494 28493 0 26527 1172 1 04:00 ? 00:00:00 /bin/sh -c cd /scratch/rabartl/Trilinos.base/SEMSCIBuild && cp trilinos_ci_server.out trilinos_ci_server.last.out && ./Trilinos/cmake/ctest/drivers/sems_ci/trilinos_ci_sever.sh &> trilinos_ci_server.out
rabartl 42175 21021 0 25834 956 1 08:41 pts/4 00:00:00 grep trilinos_ci_server.out
rabartl 45160 45156 0 26527 1172 22 Jan14 ? 00:00:00 /bin/sh -c cd /scratch/rabartl/Trilinos.base/SEMSCIBuild && cp trilinos_ci_server.out trilinos_ci_server.last.out && ./Trilinos/cmake/ctest/drivers/sems_ci/trilinos_ci_sever.sh &> trilinos_ci_server.out
I will killed the jobs and restart the CI server. But it is time for a solution so that this does not happen again.
@bartlettroscoe I think this may have to do with issue with an /ml_cap
mounting issue that I observed earlier, and seems to have happened again at least yesterday on CEE machines. This causes CMake to hang indefinitely while searching for Doxygen. It may not be caused by simultaneous builds.
FYI: The CI build was broken continuously since 2/15/2018 (the checkin-test-sems.sh script was not allowed to test downstream packages, see #2254) but I did not notice because the CDash emails were not going out (see #2255) . I backed out that commit today and got one passing CI build iteration but then another PR branch was merged that broke the CI build again (see #2264). But that was related to PR #2171. But we can't see what was tested and what passed. But I will fix the standard CI build right now, see #2264 for more details.
FYI: Those using the checkin-test-sems.sh script to push will now be in good shape but the automated PR testing of every branch merged against 'develop' should fail (if it involves changes to MueLu or upstream packages). See:
FYI: An untested push to Trilinos just broke the Stokhos library build (see #2315) and therefore breaks everyone using checkin-test-sems.sh and will break all of the auto PR builds. I am running the checkin-test-sems.sh script right now to push a backout of this commit.
FYI: I pushed the revert commit for #2315 so the CI build. The next CI iteration should be clean of this failure.
And the next CI build was clean. All good now. See https://github.com/trilinos/Trilinos/issues/2315#issuecomment-369776161.
FYI: There was a global configure failure in the CI build that was caused by a simplification I made in commit dd67e68340b725584ed927c0add529997515ff98 (see #2378). I forgot to update the ctest -S driver script to explicitly pass in SEMSDevEnv.cmake. I have fixed this and pushed it and am testing locally now. Once that local testing is complete, I will restart the CI server.
As part of this I might consider moving this CI server to a jenkins-srn machine. But I worry that might destabilize the CI server given problems with overloading of the Jenkins machines we have seen on the Jenkins machines so that might not be a good idea. I am not sure we can trust a CI server to these Jenkins build farms the way they are currently set up and are being used.
FYI: I sent out the following email about the failures due to this transition in #2378.
From: Bartlett, Roscoe A Sent: Wednesday, March 14, 2018 8:56 AM To: trilinos-developers@trilinos.org Subject: Ignore CDash failures from two builds this morning
Hello Trilinos Developers,
Please disregard the CDash configure failure emails coming from the builds Linux-GCC-4.8.4-MPI_RELEASE_DEBUG_SHARED_PT_CI and Trilinos-atdm-sems-gcc-7-2-0. This was from a global configure problem for these two builds that caused every package configure to fail.
The problem should be solved now.
-Ross
FYI: The problems have been resolved and the CI build is running again as shown at:
Sorry for everyone getting spammed by CDash this morning. Hopefully this will not lead to people putting in email filters for CDash error emails :-(
FYI: The CI build was fixed yesterday as noted above but the first CI iteration today showed the hanging test Zoltan2_OrderingScotch_MPI_4
. This test is being disabled for this CI build and should not be a problem in the future. See #2397, #2131. But this shows that the auto PR build needs to enable Scotch as well (see #2065).
FYI: The CI build was clean in the next CI iteration as shown at:
Given the new and improving auto PR testing process, that process is doing a good basic job of stabilizing the 'develop' branch of Trilinos (as good or better than the usage of the checkin-test-sems.sh script). I can be improved (buy adding a CUDA build and tweaking some other builds as described in #2317) but I think this is pretty good.
We don't need this issue anymore.
@trilinos/framework team, thanks for getting that auto PR testing process set up and continuously improving it!
Closing this issue as complete.
Related to: #1362
Next Action Status:
The auto PR testing process (#1155) is deployed and is working fairly well to stabilize 'develop' (at least as good or better than the checkin-test-sems.sh script did). Further improvements will be worked in other issues.
Description:
This story is to discuss and decide how to address stability problems of the Trilinos 'develop' branch in the short term. I know there is a long-term plan to use a PR model (see #1155) but since there are no updates or ETA on that, we need to address stability issues faster than that.
Currently there have been a good deal of stability problems of the Trilinos 'develop' branch, even with the basic CI build linked to from:
and the "Clean" builds shown here:
The "Clean" builds have never been clean in the entire history of the track.
Some very recent examples of failures causing this are described in #1290 and #1301. These have broken the standard CI build and the "Clean" builds continuously since May 4 (and it is still broken as I type this).
We need a strategy to improve stability right now. I have been helping people set up to use the checkin-test-sems.sh script to test and push their changes. I would estimate that a large percentage of the failures (and 100% of the CI failures) seen on CDash would be avoided by usage of the checkin-test-sems.sh script.
CC: @trilinos/framework