Closed bartlettroscoe closed 6 years ago
We are still seeing the ROL examples crashing the build for the CI build Linux-GCC-4.7.2-CONTINUOUS_MPI_OPT_DEV_SHARED
run on sadl30906.srn.sandia.gov
shown as recently morning at:
Therefore, it is time to disable ROL in this CI build. But we never see this build failure in the CI build Linux-GCC-4.8.4-MPI_RELEASE_DEBUG_SHARED_PT_CI
run on ceerws1113
so we will not be loosing any testing by making this change.
Wow, I though that the build Linux-GCC-4.7.2-CONTINUOUS_MPI_OPT_DEV_SHARED
could not really using GCC 4.7.2 since we I though that must be a misprint. But looking at the configure output at:
we see that it really is using GCC 4.7.2 as shown at:
-- CMAKE_C_COMPILER_ID='GNU'
-- CMAKE_C_COMPILER_VERSION='4.7.2'
-- CMAKE_CXX_COMPILER_ID='GNU'
-- CMAKE_CXX_COMPILER_VERSION='4.7.2'
This build should likely be turned off until it can be upgraded to GCC 4.8.4. We are no longer supporting C++11 for GCC versions less than 4.8.4 (see #1453).
Therefore, I will disable this build until the @trilinos/framework team can upgrade this build.
@trilinos/amesos2 If somebody familiar with Amesos2 could review and merge PR #1532, that would knock 3 of the 5 remaining test failures off the dashboard.
There is a new failing ROL CI test ROL_test_sol_checkAlmostSureConstraint_MPI_1
that was pushed last night shown here:
This failure was triggered by one of the commits:
5140740: Teuchos: raise Parser sub-package to PS status
Author: Dan Ibanez <daibane@sandia.gov>
Date: Thu Aug 10 09:19:26 2017 -0600
M packages/teuchos/cmake/Dependencies.cmake
3bcfc73: Merge remote branch 'intermediate-repo/develop' into develop
Author: Irina K. Tezaur <ikalash@sandia.gov>
Date: Thu Aug 10 12:35:18 2017 -0600
25e539c: Piro: adding ALBANY_BUILD ifdef logic to Piro::TempusSolver to get the right template arguments when constructing a Piro::TempusSolver object in Albany.
Author: Irina K. Tezaur <ikalash@sandia.gov>
Date: Thu Aug 10 11:34:08 2017 -0700
M packages/piro/src/Piro_TempusSolver.hpp
shown at:
It looks like this is a consequence of the failure described in #1596.
@ibaned, even though enabling TeuchosParser should not have caused this failure, it may be moving things around in memory that might have caused this ROL test to start showing erratic behavior.
I am disabling that failing test ASAP so that this does not trip up anyone else.
Wow, my push shown below just this morning dogged the failing ROL test because ROL as no dependency on STK. Others will not be so lucky. I am in the process of running the checkin-test-sems.sh script to disable this failing ROL test for the CI build (but no other builds).
DID PUSH: Trilinos: crf450.srn.sandia.gov
Fri Aug 11 08:55:12 MDT 2017
Enabled Packages: STKUtil
Disabled Packages: PyTrilinos,Claps,TriKota Enabled all Forward Packages
Build test results:
-------------------
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=162,notpassed=0 (41.88 min)
*** Commits for repo :
96ab45d stk_util: Add optional compilation of this file
cdc7a62 Make STKUtil dependence on SEACASAprepro_lib optional (stk-17354)
0) MPI_RELEASE_DEBUG_SHARED_PT Results:
---------------------------------------
passed: Trilinos/MPI_RELEASE_DEBUG_SHARED_PT: passed=162,notpassed=0
Fri Aug 11 08:55:06 MDT 2017
Enabled Packages: STKUtil
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages
Hostname: crf450.srn.sandia.gov
Source Dir: /home/rabartl/Trilinos.base/Trilinos/cmake/tribits/ci_support/../../..
Build Dir: /home/rabartl/Trilinos.base/BUILDS/CHECKIN/MPI_RELEASE_DEBUG_SHARED_PT
CMake Cache Varibles: -DTrilinos_TRIBITS_DIR:PATH=/home/rabartl/Trilinos.base/Trilinos/cmake/tribits -DTrilinos_ENABLE_TESTS:BOOL=ON -DTrilinos_TEST_CATEGORIES:STRING=BASIC -DTrilinos_ALLOW_NO_PACKAGES:BOOL=OFF -DDART_TESTING_TIMEOUT:STRING=300.0 -DBUILD_SHARED_LIBS=ON -DTrilinos_DISABLE_ENABLED_FORWARD_DEP_PACKAGES=ON -DTrilinos_TRACE_ADD_TEST=ON -DTrilinos_ENABLE_SECONDARY_TESTED_CODE:BOOL=OFF -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/MpiReleaseDebugSharedPtSettings.cmake,cmake/std/BasicCiTestingSettings.cmake -DTrilinos_ENABLE_STKUtil:BOOL=ON -DTrilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=ON -DTrilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES:BOOL=ON -DTrilinos_ENABLE_PyTrilinos:BOOL=OFF -DTrilinos_ENABLE_Claps:BOOL=OFF -DTrilinos_ENABLE_TriKota:BOOL=OFF
Make Options: -j16
CTest Options: -j16
Pull: Passed (0.00 min)
Configure: Passed (0.63 min)
Build: Passed (40.21 min)
Test: Passed (1.04 min)
100% tests passed, 0 tests failed out of 162
Label Time Summary:
Panzer = 271.30 sec (132 tests)
STK = 24.75 sec (11 tests)
TrilinosCouplings = 41.60 sec (19 tests)
Total Test time (real) = 62.23 sec
Total time for MPI_RELEASE_DEBUG_SHARED_PT = 41.88 min
I just pushed a commit to disable this failing ROL test in the CI build (see https://github.com/trilinos/Trilinos/issues/1596#issuecomment-321843345). However, it will not be in time to help @ikalash who looks to be running the checkin-test-sems.sh script right now on ceerws1113 trying to push. Her push will likely be blocked due to this failing test. But all she needs to do is to run it again and it will pass since I have pushed the commit to disable it (and I sent her and email stating that).
NOTE: I am going on vacation for the next two weeks after today and will not be back till Monday 8/28. While I am gone, can someone on the @trilinos/framework team keep an eye on this CI build and resolve issues like this (and restor 100% passing ASAP by disabling tests or backing out commits when needed)? If we get lucky, no problems will pop up while I am gone. But we barely made it two weeks since the last failure (see above). If you don't keep the CI build 100% clean at all times, things break down very quickly.
Please lets get the automated PR testing and merging system stood up (#1155)!
CI build was clean again as yesterday in the first CI build:
However, it just got broke again. I will comment on that in the next comment.
The CI build was passing for all of 6 hours before it was broken again with an Intrepid2 test build failure (see #1600) shown at:
While this test build failure in Intrepid2 persists, anyone trying to use the checkin-test-sems.sh script to push changes to the following upstream packages from Intrepid2 will have their pushes stopped:
I will work to externally disable just that one test build so that it will not trip up anyone until it can be fixed.
I surgically disabled just that one failing Intrepid2 test as described at https://github.com/trilinos/Trilinos/issues/1600#issuecomment-321990301. The next CI iteration should be clean.
I also provided full instructions on how to revert the disable, fix the failure, and push using checkin-test-sems.sh to avoid another breakage of the CI build.
BTW, as discussed above, the problematic CI build Linux-GCC-4.8.4-MPI_RELEASE_DEBUG_SHARED_PT_CI
has been disabled as of 8/10 as shown at:
Now just the one single CI build shown at:
is running. Now we just need to keep it clean.
The CI build for Inprepid2 is clean again as shown at:
You can see that Intrepid2 test being disabled at:
which shows:
-- Intrepid2_refactor_unit-test_Orientation_Serial_Test_OrientationToolsCoeffMatrix EXE NOT being built due to Intrepid2_refactor_unit-test_Orientation_Serial_Test_OrientationToolsCoeffMatrix_EXE_DISABLE='ON'
-- Intrepid2_refactor_unit-test_Orientation_Serial_Test_OrientationToolsCoeffMatrix_MPI_1: NOT added test because Intrepid2_refactor_unit-test_Orientation_Serial_Test_OrientationToolsCoeffMatrix_MPI_1_DISABLE='ON'!
And you can see a "-1" for the number of "Not Run" tests for Intrepid2 shown at:
Merged @brian-kelley PR #1532 regarding the Amesos2 failures. This is also tracked in #1495.
@bartlettroscoe is it possible to have the checkin-script set $OMP_PROC_BIND=false
when running the tests?
It took me a couple tries to realize that having $OMP_PROC_BIND=true
was the reason behind timing out on most of the tests when run with mpi.
Getting back from a 2-week vacation looking at the CI build while I was gone at:
The good news is the the CI build appears to be completely clean since later in the day last Friday 8/25/2017.
The bad news is that it looks like the CI build was broken at least 4 separate times and the last breakage lasted from 8/22/2017 to 8/25/2017. Anyone trying to use the checkin-test-sems.sh script to push to an upstream package during that time would of had their pushes stopped (I will see if there is any evidence to for that).
Since CDash only records a 6 week moving window, I will document each of the failures in comments here for archival purposes and for later analysis. I will write one comment for each failure (4 comments total).
@bartlettroscoe for the record, the Kokkos 2.04.00 snapshot commit (6811bb33bdcb1633c2b9f7cb62e94a43ef057f6c) using checkin-test-sems
was blocked by the test failure documented in #1615 which was caused by the STK snapshot that arrived about a day earlier.
The first failure that occured over the period 8/13/2017 - 8/28/2017 was caused by the push:
Wed Aug 16 12:37:48 MDT 2017
commit e15b6358a6aa24d080ef9840816d1c3c47df5fd8
Author: Brent Perschbacher <bmpersc@sandia.gov>
AuthorDate: Tue Aug 15 15:28:37 2017 -0600
Commit: Brent Perschbacher <bmpersc@sandia.gov>
CommitDate: Wed Aug 16 12:37:02 2017 -0600
Snapshot of sierra.base.git from commit 888d2b5321002c5ca3e595ad8ccf14a9c9b4addc
From repository at sierra-git.sandia.gov:/git/sierra.base.git
At commit:
commit 888d2b5321002c5ca3e595ad8ccf14a9c9b4addc
Author: Greg Sjaardema <gdsjaar@sandia.gov>
Date: Mon Aug 14 10:35:14 2017 -0600
APREPRO: Fix so will compile with intel-14
Commits pushed:
e15b635 Snapshot of sierra.base.git from commit 888d2b5321002c5ca3e595ad8ccf14a9c9b4addc
0125af1 Snapshot of sierra.base.git from commit d3329e7df739281ba6d5955d965ff2dcf6e3864d
and persisted for a little over 7 hours from "Wed Aug 16 12:37:48 MDT 2017" till "Wed Aug 16 17:55:21 MDT 2017". That push did not appear to use the checkin-tset-sems.sh script.
As noted above, this breakage stopped at least one push to Trilinos. It is unknown if this stopped any other pushes (we would have to ask since there is no archiving of failed invocations of checkin-test-sems.sh).
The following pushes occurred during that period:
commit 3ddc1f116745766ec4c6a138e0e269c1fc863ac0
Merge: e15b635 9407f03
Author: Irina K. Tezaur <ikalash@sandia.gov>
AuthorDate: Wed Aug 16 14:55:18 2017 -0600
Commit: Irina K. Tezaur <ikalash@sandia.gov>
CommitDate: Wed Aug 16 15:36:51 2017 -0600
Merge remote branch 'intermediate-repo/develop' into develop
Build/Test Cases Summary
Enabled Packages: Piro
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=147,notpassed=0 (41.40 min)
Other local commits for this build/test group: 9407f03
Commits pushed:
3ddc1f1 Merge remote branch 'intermediate-repo/develop' into develop
9407f03 Piro: adding setObserver() method to TempusSolver class.
commit 0bf149383272f8c5562c8b97736b35a02d93990d
Author: Jonathan Hu <jhu@sandia.gov>
AuthorDate: Wed Aug 16 13:58:43 2017 -0700
Commit: Jonathan Hu <jhu@sandia.gov>
CommitDate: Wed Aug 16 15:57:14 2017 -0700
MueLu: rebase interface tests
Build/Test Cases Summary
Enabled Packages: MueLu
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=148,notpassed=0 (117.23 min)
Other local commits for this build/test group: ad8a580, ae678d9, 2d730ee
Commits pushed:
0bf1493 MueLu: rebase interface tests
ad8a580 MueLu: remove option "rowWeight"
ae678d9 MueLu: avoid partition assignment step if possible
2d730ee MueLu: fix typo in message
Luckely, MueLu and Piro don't have STK as a downstream dependency so these usages of the checkin-test-sems.sh script were not blocked. But unfortunately, STK is downstream from Kokkos which blocked the Kokkos push noted above.
Also, if anyone would have tried to push to Trilinos from any of the following packages (which are all upstream dependencies of STK):
these would have also of had their pushes stopped due to this failure.
DETAILS:
The first failure that occured over the period 8/13/2017 - 8/28/2017 was first shown in the CI iteration started "Aug 16, 2017 - 18:38 UTC":
which was the addition of a new failing test STKUnit_tests_stk_tools_unit_tests_MPI_4. The Updates.txt notes file for that CI iteration shown at:
shows the two commits:
e15b635: Snapshot of sierra.base.git from commit 888d2b5321002c5ca3e595ad8ccf14a9c9b4addc
Author: Brent Perschbacher <bmpersc@sandia.gov>
Date: Tue Aug 15 15:28:37 2017 -0600
...
0125af1: Snapshot of sierra.base.git from commit d3329e7df739281ba6d5955d965ff2dcf6e3864d
Author: Brent Perschbacher <bmpersc@sandia.gov>
Date: Tue Aug 15 15:28:33 2017 -0600
...
That coorespends to the recorded push:
Wed Aug 16 12:37:48 MDT 2017
commit e15b6358a6aa24d080ef9840816d1c3c47df5fd8
Author: Brent Perschbacher <bmpersc@sandia.gov>
AuthorDate: Tue Aug 15 15:28:37 2017 -0600
Commit: Brent Perschbacher <bmpersc@sandia.gov>
CommitDate: Wed Aug 16 12:37:02 2017 -0600
Snapshot of sierra.base.git from commit 888d2b5321002c5ca3e595ad8ccf14a9c9b4addc
From repository at sierra-git.sandia.gov:/git/sierra.base.git
At commit:
commit 888d2b5321002c5ca3e595ad8ccf14a9c9b4addc
Author: Greg Sjaardema <gdsjaar@sandia.gov>
Date: Mon Aug 14 10:35:14 2017 -0600
APREPRO: Fix so will compile with intel-14
Commits pushed:
e15b635 Snapshot of sierra.base.git from commit 888d2b5321002c5ca3e595ad8ccf14a9c9b4addc
0125af1 Snapshot of sierra.base.git from commit d3329e7df739281ba6d5955d965ff2dcf6e3864d
These commits and push are part of the offical integration process for changes to STK and SEACAS in SIERRA back to Trilinos (one commit for STK and one commit for SEACAS). The testing process for those commits does not use the checkin-test-sems.sh script, which allowed this broken test to get pushed. This test was disabled in a follow-up push and the first fixed CI iteration started at "Aug 17, 2017 - 10:00 UTC" was
which involved the commits pulled in:
which included the commit:
ddb3fe7: STK: Disable stk_tools test due to continuous failure.
Author: Brent Perschbacher <bmpersc@sandia.gov>
Date: Wed Aug 16 17:50:59 2017 -0600
M packages/stk/stk_unit_tests/stk_tools/CMakeLists.txt
That commit was pushed as part of the push:
Wed Aug 16 17:55:21 MDT 2017
commit ddb3fe783a7d6aa8390429dae4c974e8f847079a
Author: Brent Perschbacher <bmpersc@sandia.gov>
AuthorDate: Wed Aug 16 17:50:59 2017 -0600
Commit: Brent Perschbacher <bmpersc@sandia.gov>
CommitDate: Wed Aug 16 17:53:01 2017 -0600
STK: Disable stk_tools test due to continuous failure.
Issue was reported in #1615. I need to speak with the STK team
to figure out the right fix, but for now disabling.
Commits pushed:
ddb3fe7 STK: Disable stk_tools test due to continuous failure.
This means the CI build was broken for over 7 hours.
This issue was fixed back in native SIERRA sources for Trilinos and was then snapshotted back to Trilinos and that test STKUnit_tests_stk_tools_unit_tests_MPI_4 reappeared in the CI iteration started at "Aug 18, 2017 - 17:59 UTC ":
The second failure that occurred over the period 8/13/2017 - 8/28/2017 was caused by the push:
Thu Aug 17 11:03:52 MDT 2017
commit 79def1e59538a35535afb1fb6e43bebf7d105805
Author: Chris Siefert <csiefer@sandia.gov>
AuthorDate: Thu Aug 17 11:01:08 2017 -0600
Commit: Chris Siefert <csiefer@sandia.gov>
CommitDate: Thu Aug 17 11:03:17 2017 -0600
Ctest: Adding -fno-var-tracking to dorksaber builds to remove weird gcc warning
Commits pushed:
79def1e Ctest: Adding -fno-var-tracking to dorksaber builds to remove weird gcc warning
36340b7 MueLu: Hopefully fixing outstanding MueLu error on Geminga
It looks like that push did not use the checkin-test-sems.sh script, which allowed the error to get pushed.
This resulted in the CI build of Trilinos to be broken continuously for only about an hour from "Thu Aug 17 11:03:52 MDT 2017" to "Thu Aug 17 12:06:46 MDT 2017". Therefore, it seems unlikely that anyone's pushes would have been stopped due to this. And only people pushing to packages upstream from Xpetra:
would of had their push stopped. So it is unlikely that anyone was inconvenienced by this bad push.
DETAILS:
The second failure that occurred over the period 8/13/2017 - 8/28/2017 was captured in the CI iteration started at "Aug 17, 2017 - 17:05 UTC":
and showed the failing test Xpetra_BlockedCrsMatrix_UnitTests_MPI_4. The commits pulled in this CI iteration are shown at:
which shows the commits:
930f58a: Ctest: More dorksaber warning cleanup
Author: Chris Siefert <csiefer@sandia.gov>
Date: Thu Aug 17 11:04:35 2017 -0600
M cmake/ctest/drivers/dorksaber/TrilinosCTestDriverCore.dorksaber.gcc.cmake
79def1e: Ctest: Adding -fno-var-tracking to dorksaber builds to remove weird gcc warning
Author: Chris Siefert <csiefer@sandia.gov>
Date: Thu Aug 17 11:01:08 2017 -0600
M cmake/ctest/drivers/dorksaber/ctest_linux_nightly_mpi_release_tpetrakernels_experimental_dorksaber.cmake
M cmake/ctest/drivers/dorksaber/ctest_linux_nightly_serial_release_muelu_matlab_dorksaber.cmake
36340b7: MueLu: Hopefully fixing outstanding MueLu error on Geminga
Author: Chris Siefert <csiefer@sandia.gov>
Date: Thu Jul 6 14:27:05 2017 -0600
M packages/xpetra/sup/Utils/Xpetra_MatrixMatrix.hpp
Therefore, this failure likely corresponds to the push:
Thu Aug 17 11:03:52 MDT 2017
commit 79def1e59538a35535afb1fb6e43bebf7d105805
Author: Chris Siefert <csiefer@sandia.gov>
AuthorDate: Thu Aug 17 11:01:08 2017 -0600
Commit: Chris Siefert <csiefer@sandia.gov>
CommitDate: Thu Aug 17 11:03:17 2017 -0600
Ctest: Adding -fno-var-tracking to dorksaber builds to remove weird gcc warning
Commits pushed:
79def1e Ctest: Adding -fno-var-tracking to dorksaber builds to remove weird gcc warning
36340b7 MueLu: Hopefully fixing outstanding MueLu error on Geminga
As shown in the commit log, it does not look like the checkin-test-sems.sh script was used to test and push this (which explains how this error was able to get pushed).
This was fixed pretty quickly in the very next CI iteration started at "Aug 17, 2017 - 19:49 UTC":
with that test going from failing to passing. The Updates.txt file for that CI iteration shown at:
shows the commits:
4d0b31: MueLu: clean up Aria driver
Author: Tobias Wiesner <tawiesn@sandia.gov>
Date: Tue Aug 15 15:54:18 2017 -0600
M packages/muelu/research/tawiesn/aria/Driver.cpp
5f46414: MueLu: remove FacadeClassFactory from Crada driver routine
Author: Tobias Wiesner <tawiesn@sandia.gov>
Date: Tue Aug 15 14:27:00 2017 -0600
M packages/muelu/research/tawiesn/crada/Driver.cpp
65b1fdc: Revert "MueLu: Hopefully fixing outstanding MueLu error on Geminga"
Author: Chris Siefert <csiefer@sandia.gov>
Date: Thu Aug 17 12:05:41 2017 -0600
M packages/xpetra/sup/Utils/Xpetra_MatrixMatrix.hpp
So the fixing commit was likely part of the push:
Thu Aug 17 12:06:46 MDT 2017
commit 65b1fdc934709b7cbce9326904fa9652d0492eb7
Author: Chris Siefert <csiefer@sandia.gov>
AuthorDate: Thu Aug 17 12:05:41 2017 -0600
Commit: Chris Siefert <csiefer@sandia.gov>
CommitDate: Thu Aug 17 12:05:41 2017 -0600
Revert "MueLu: Hopefully fixing outstanding MueLu error on Geminga"
This reverts commit 36340b745843e3ebc801567fa086c6fb86b48c1f.
Commits pushed:
65b1fdc Revert "MueLu: Hopefully fixing outstanding MueLu error on Geminga"
So this was fixed quickly by just reverting the commit.
The third CI failure that occurred over the period 8/13/2017 - 8/28/2017 was caused by the push:
Fri Aug 18 15:51:37 MDT 2017
commit ec85c46917ac532a676f68ce2a27b305fbfbb4f9
Merge: 5eeea40 8d29d08
Author: Mehmet Deveci <mndevec@sandia.gov>
AuthorDate: Fri Aug 18 15:51:14 2017 -0600
Commit: Mehmet Deveci <mndevec@sandia.gov>
CommitDate: Fri Aug 18 15:51:14 2017 -0600
Merge branch 'develop' of github.com:trilinos/Trilinos into develop
Commits pushed:
ec85c46 Merge branch 'develop' of github.com:trilinos/Trilinos into develop
5eeea40 Ifpack2: added a parameter to avoid symmetrization and write the given matrix to output. #1622
which broke the two tests MueLu_ParameterListInterpreterTpetra_MPI_1 and MueLu_ParameterListInterpreterTpetra_MPI_4.
It looks like that push did not use the checkin-test-sems.sh script, which allowed the error to get pushed.
The failing tests caused by this push were later fixed by a different developer in the push:
Sat Aug 19 10:59:57 MDT 2017
commit 0ffff8da7fc16b6aa231052efee836c065a23421
Author: Andrey Prokopenko <prokopenkoav@ornl.gov>
AuthorDate: Thu Aug 17 13:14:54 2017 -0400
Commit: Andrey Prokopenko <prokopenkoav@ornl.gov>
CommitDate: Sat Aug 19 12:59:46 2017 -0400
MueLu: remove Tpetra version of Isorropia
Build/Test Cases Summary
Enabled Packages: MueLu
Enabled all Forward Packages
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=430,notpassed=0 (85.98 min)
Other local commits for this build/test group: a93e979917
Commits pushed:
0ffff8d MueLu: remove Tpetra version of Isorropia
a93e979 MueLu: updating interface tests for Ifpack2 5eeea40 changes
This resulted in the CI build of Trilinos to be broken continuously for 19 hours from "Fri Aug 18 15:51:37 MDT 2017" to "Sat Aug 19 10:59:57 MDT 2017". Therefore, anyone who would have tried to push to MueLu or any of its upstream packages:
using the checkin-test-sems.sh script during that time period would of had their push stopped.
The only push that occurred between the breaking and fixing push recorded was:
Sat Aug 19 00:42:08 MDT 2017
commit ba9cd117e4ac749f36c4d41240f06512df332915
Author: Mauro Perego <mperego@sandia.gov>
AuthorDate: Fri Aug 18 18:40:35 2017 -0600
Commit: Mauro Perego <mperego@sandia.gov>
CommitDate: Fri Aug 18 19:41:09 2017 -0600
Intrepid2: Added unit-test for testing otrientation tools for Tet. Still working on it. Tested HGRAD and HDIV. HDIV works only for low order basis functions.
Commits pushed:
ba9cd11 Intrepid2: Added unit-test for testing otrientation tools for Tet. Still working on it. Tested HGRAD and HDIV. HDIV works only for low order basis functions.
fdda779 Intrepid2: allow function clone to accept an input view with rank 3
Since MueLu does not depend on Intrepid2, this push would have been allowed to go through (but as you see it does not look like the checkin-test-sems.sh script was used for this push either). However, it is not clear if anyone's pushes were stopped during this time period.
Also note that there was a a single failure of the test PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Order-1 that appears to have not been caused or fixed by any commit. Therefore, we need to keep an eye on this test as being a potentially fragile test.
DETAILS:
The third failure that occurred over the period 8/13/2017 - 8/28/2017 was captured in the CI iteration started at "Aug 18, 2017 - 22:03 UTC":
with the failing tests MueLu_ParameterListInterpreterTpetra_MPI_1 and MueLu_ParameterListInterpreterTpetra_MPI_4. The commits pulled this CI iteration shown at:
where:
ec85c46: Merge branch 'develop' of github.com:trilinos/Trilinos into develop
Author: Mehmet Deveci <mndevec@sandia.gov>
Date: Fri Aug 18 15:51:14 2017 -0600
5eeea40: Ifpack2: added a parameter to avoid symmetrization and write the given matrix to output. #1622
Author: Mehmet Deveci <mndevec@sandia.gov>
Date: Fri Aug 18 15:50:53 2017 -0600
M packages/ifpack2/src/Ifpack2_Relaxation_decl.hpp
M packages/ifpack2/src/Ifpack2_Relaxation_def.hpp
8d29d08: MueLu: replace tabs by spaces
Author: Tobias Wiesner <tawiesn@sandia.gov>
Date: Fri Aug 18 11:43:50 2017 -0600
M packages/muelu/research/max/XpetraSplitting/Test_muelu.cpp
M packages/muelu/research/max/XpetraSplitting/Test_xpetra.cpp
M packages/muelu/research/max/XpetraSplitting/Xpetra_Level_def.hpp
M packages/muelu/research/max/XpetraSplitting/Xpetra_MatrixSplitting.hpp
M packages/muelu/research/max/XpetraSplitting/Xpetra_RegionAMG_decl.hpp
M packages/muelu/research/max/XpetraSplitting/Xpetra_RegionAMG_def.hpp
M packages/muelu/research/max/XpetraSplitting/Xpetra_RegionHandler_decl.hpp
M packages/muelu/research/max/XpetraSplitting/Xpetra_RegionHandler_def.hpp
7a37f4a: MueLu: add Belos solver to Aria Driver
Author: Tobias Wiesner <tawiesn@sandia.gov>
Date: Fri Aug 18 11:18:38 2017 -0600
M packages/muelu/research/tawiesn/aria/Driver.cpp
From looking at the set of tests that failed and the commits pushed, it is not clear what caused the failure. From looking at the details from the failing tests at:
I am seeing similar failures. For example, I see the same failure in both tests:
Level 2
Build (MueLu::RebalanceTransferFactory)
EasyParameterListInterpreter/repartition4_np4.xml : failed
These same two MueLu tests failed the next CI iteration started at "Aug 19, 2017 - 10:00 UTC ":
In addition, a new Panzer test PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Order-1 failed in that CI iteration. The commits that CI iteration shown at:
were:
ba9cd11: Intrepid2: Added unit-test for testing otrientation tools for Tet. Still working on it. Tested HGRAD and HDIV. HDIV works only for low order basis functions.
Author: Mauro Perego <mperego@sandia.gov>
Date: Fri Aug 18 18:40:35 2017 -0600
M packages/intrepid2/refactor/unit-test/Orientation/Serial/CMakeLists.txt
A packages/intrepid2/refactor/unit-test/Orientation/Serial/test_orientation_TET.cpp
A packages/intrepid2/refactor/unit-test/Orientation/test_orientation_TET.hpp
fdda779: Intrepid2: allow function clone to accept an input view with rank 3
Author: Mauro Perego <mperego@sandia.gov>
Date: Fri Aug 18 18:37:38 2017 -0600
M packages/intrepid2/refactor/src/Shared/Intrepid2_RealSpaceToolsDef.hpp
Neither of these commits would seem to be responsible for this new Panzer test failure so that could be a fluke.
The CI build after that starting "Aug 19, 2017 - 17:01 UTC":
was totally clean showing these three tests MueLu_ParameterListInterpreterTpetra_MPI_1, and MueLu_ParameterListInterpreterTpetra_MPI_4, and PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Order-1 moving from failing to passing.
The commits pulled this CI iteration shown at:
were:
0ffff8d: MueLu: remove Tpetra version of Isorropia
Author: Andrey Prokopenko <prokopenkoav@ornl.gov>
Date: Thu Aug 17 13:14:54 2017 -0400
M packages/muelu/src/Rebalancing/MueLu_IsorropiaInterface_decl.hpp
M packages/muelu/src/Rebalancing/MueLu_IsorropiaInterface_def.hpp
a93e979: MueLu: updating interface tests for Ifpack2 5eeea40 changes
Author: Andrey Prokopenko <prokopenkoav@ornl.gov>
Date: Sat Aug 19 11:25:17 2017 -0400
M packages/muelu/test/interface/Output/MLaux_tpetra.gold
M packages/muelu/test/interface/Output/MLcoarse1_tpetra.gold
M packages/muelu/test/interface/Output/MLcoarse2_tpetra.gold
...
So it looks like Andrey fixed the failing MueLu tests that were caused by earlier commit 5eeea40. The push that pushed commit 5eeea40 was:
Fri Aug 18 15:51:37 MDT 2017
commit ec85c46917ac532a676f68ce2a27b305fbfbb4f9
Merge: 5eeea40 8d29d08
Author: Mehmet Deveci <mndevec@sandia.gov>
AuthorDate: Fri Aug 18 15:51:14 2017 -0600
Commit: Mehmet Deveci <mndevec@sandia.gov>
CommitDate: Fri Aug 18 15:51:14 2017 -0600
Merge branch 'develop' of github.com:trilinos/Trilinos into develop
Commits pushed:
ec85c46 Merge branch 'develop' of github.com:trilinos/Trilinos into develop
5eeea40 Ifpack2: added a parameter to avoid symmetrization and write the given matrix to output. #1622
As you can see, there is no sign that the checkin-test-sems.sh script was used to push these commits, which would explain the failures that occurred.
The push that fixed this was:
Sat Aug 19 10:59:57 MDT 2017
commit 0ffff8da7fc16b6aa231052efee836c065a23421
Author: Andrey Prokopenko <prokopenkoav@ornl.gov>
AuthorDate: Thu Aug 17 13:14:54 2017 -0400
Commit: Andrey Prokopenko <prokopenkoav@ornl.gov>
CommitDate: Sat Aug 19 12:59:46 2017 -0400
MueLu: remove Tpetra version of Isorropia
Build/Test Cases Summary
Enabled Packages: MueLu
Enabled all Forward Packages
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=430,notpassed=0 (85.98 min)
Other local commits for this build/test group: a93e979917
Commits pushed:
0ffff8d MueLu: remove Tpetra version of Isorropia
a93e979 MueLu: updating interface tests for Ifpack2 5eeea40 changes
Therefore, the CI build of Trilinos (with failing MueLu tests) was broken from "Fri Aug 18 15:51:37 MDT 2017" to "Sat Aug 19 10:59:57 MDT 2017", or about 19 hours.
The fourth and final CI failure that occurred over the period 8/13/2017 - 8/28/2017 was caused by the push:
Mon Aug 21 12:19:11 MDT 2017
commit 0c7f6312ff2fe596f672ee9b771ca989ee61afe1
Author: Matthias Mayr <mmayr@sandia.gov>
AuthorDate: Fri Jun 30 10:09:28 2017 -0700
Commit: Matthias Mayr <mmayr@sandia.gov>
CommitDate: Mon Aug 21 11:18:20 2017 -0700
Xpetra: reduce number of for-loops in concatenateMaps()
Build/Test Cases Summary
Enabled Packages: MueLu, Stokhos, Xpetra
Enabled all Forward Packages
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1144,notpassed=0 (123.40 min)
Other local commits for this build/test group: ffe7619, bd7e914, 130bcaf, cc0b9a7, c406050, e995e6c
Commits pushed:
0c7f631 Xpetra: reduce number of for-loops in concatenateMaps()
ffe7619 Xpetra: updated doxygen documentation
bd7e914 MueLu: fixed compiler warning
130bcaf Xpetra: update documenation for bgs_apply
cc0b9a7 MueLu: Added BlockedJacobiSmoother
c406050 updated list of developers
e995e6c Removed HAVE_MUELU_EXPERIMENTAL guard for blocked operators
which broke the test MueLu_BlockedTransfer_Tpetra_MPI_4 for the post-push CI build run on ceerws1113. The checkin-test-sems.sh script looks to have been used but there are only 1144 tests shown and there should have been 1146 tests run (see below) so it is not clear if this test was passing on this person's machine or was failing and they explicilty disabled it or if something else happened (the push was not logged to the trilinos-checkin-tests email list so we can't see). But this test for the same version of Trilinos passed on my machine crf450 so it seems this test passed on some platforms and failed on others (i.e. a very badly behaving test).
This resulted in the CI build of Trilinos to be broken continuously for the better part of 4 days from "Mon Aug 21 12:19:11 MDT 2017" to "Fri Aug 25 09:23:39 MDT 2017". Therefore, anyone who would have tried to push to MueLu or any of its upstream packages:
using the checkin-test-sems.sh script during that time period might of had their push stopped. There is evidence that occurred to several people (see below).
This failure represents a difficult case in that the test appears to have passed on some machines but failed on others. But it represents a bit of a failure of the development community that took the better part of 4 days to address. In the meantime, it definely impacted people's work (as evidenced below).
To see how big of an impact this had on people's productivity and get to the bottom of what happened we would need to:
<mmayr@sandia.gov>
if he ever saw that test failing on his machine.<lberge@sandia.gov>
if he saw this test failing on his machine.<csiefer@sandia.gov>
, crtrott <crtrott@sandia.gov>
, and Andrey Prokopenko <prokopenkoav@ornl.gov>
if they saw this test failing on their machines and see if that is why they did not use checkin-test-sems.sh to push.If we did into this more as a learning use case, we will create a new Trilinos GitHub issue to do so.
I will bring up how to better deal with failures like this at the next Trilinos Leaders Meeting to more quickly minimize impact on Trilinos developers and users.
DETAILS:
The fourth failure that occurred over the period 8/13/2017 - 8/28/2017 was captured in the CI iteration started at "Aug 21, 2017 - 18:20 UTC":
with the failing newly added test MueLu_BlockedTransfer_Tpetra_MPI_4. The commits pulled this CI iteration shown at:
were:
0c7f631: Xpetra: reduce number of for-loops in concatenateMaps()
Author: Matthias Mayr <mmayr@sandia.gov>
Date: Fri Jun 30 10:09:28 2017 -0700
M packages/xpetra/src/Utils/Xpetra_MapUtils.hpp
ffe7619: Xpetra: updated doxygen documentation
Author: Matthias Mayr <mmayr@sandia.gov>
Date: Fri Jun 30 09:57:52 2017 -0700
M packages/xpetra/doc/Xpetra_DoxygenDocumentation.hpp
bd7e914: MueLu: fixed compiler warning
Author: Matthias Mayr <mmayr@sandia.gov>
Date: Fri Jun 30 09:37:08 2017 -0700
A packages/muelu/src/Utils/MueLu_UtilitiesBase_decl.hpp.orig
130bcaf: Xpetra: update documenation for bgs_apply
Author: Matthias Mayr <mmayr@sandia.gov>
Date: Tue Jun 27 09:35:17 2017 -0700
M packages/xpetra/src/BlockedCrsMatrix/Xpetra_BlockedCrsMatrix.hpp
cc0b9a7: MueLu: Added BlockedJacobiSmoother
Author: Matthias Mayr <mmayr@sandia.gov>
Date: Tue Jun 27 09:32:50 2017 -0700
M packages/muelu/src/CMakeLists.txt
...
A packages/stokhos/src/muelu/explicit_instantiation/MueLu_BlockedJacobiSmoother.cpp
...
c406050: updated list of developers
Author: Matthias Mayr <mmayr@sandia.gov>
Date: Tue Jun 27 08:50:14 2017 -0700
M packages/muelu/doc/MueLu_DoxygenDocumentation.hpp
e995e6c: Removed HAVE_MUELU_EXPERIMENTAL guard for blocked operators
Author: Matthias Mayr <mmayr@sandia.gov>
Date: Tue May 9 14:08:57 2017 -0700
M packages/muelu/src/CMakeLists.txt
...
M packages/stokhos/src/muelu/explicit_instantiation/MueLu_BlockedCoarseMapFactory.cpp
...
Given that 18 new tests showed up, one would assume that the new tests were turned on by the commit "e995e6c: Removed HAVE_MUELU_EXPERIMENTAL guard for blocked operators". All of these commits would pushed at the same time in the push:
Mon Aug 21 12:19:11 MDT 2017
commit 0c7f6312ff2fe596f672ee9b771ca989ee61afe1
Author: Matthias Mayr <mmayr@sandia.gov>
AuthorDate: Fri Jun 30 10:09:28 2017 -0700
Commit: Matthias Mayr <mmayr@sandia.gov>
CommitDate: Mon Aug 21 11:18:20 2017 -0700
Xpetra: reduce number of for-loops in concatenateMaps()
Build/Test Cases Summary
Enabled Packages: MueLu, Stokhos, Xpetra
Enabled all Forward Packages
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1144,notpassed=0 (123.40 min)
Other local commits for this build/test group: ffe7619, bd7e914, 130bcaf, cc0b9a7, c406050, e995e6c
Commits pushed:
0c7f631 Xpetra: reduce number of for-loops in concatenateMaps()
ffe7619 Xpetra: updated doxygen documentation
bd7e914 MueLu: fixed compiler warning
130bcaf Xpetra: update documenation for bgs_apply
cc0b9a7 MueLu: Added BlockedJacobiSmoother
c406050 updated list of developers
e995e6c Removed HAVE_MUELU_EXPERIMENTAL guard for blocked operators
So in this case, it looks like the checkin-test-sems.sh script was used. Yet, the test is shown as failing on this post-push CI server. Therefore, we need to dig a little deeper to see how this could happen.
The first clue that something is not quite right is that the above commit log shows only 1114 tests (all passing). But the post-push CI server shows a total of 1146 tests (1145 passing, 1 failing). So it looks like two tests were missing in the pre-push testing. How can this be? Let's dig deeper. First, note that this push does not appear to have been logged to the trilinos-checkin-test mailmain list for August shown at:
If it would have been logged, it would have been logged in between the following two logged pushes:
It is possible that the argument --send-final-push-email-to was overridden to zero it out so we did not get that push logged. Otherwise, we would have seen what machine this push occurred on and found other details that might explain what happened.
Given that the push was not logged to the mail list, I will need to see if I can reproduce this CI build myself on my own machine crf450 which is not the same machine ceerws1113 that the post-push CI server runs on.
First, I checkout that exact version of Trilinos:
$ cd ~/Trilinos.base2/Trilinos/
$ git fetch
$ git checkout 0c7f631
Note: checking out '0c7f631'.
You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.
If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:
git checkout -b <new-branch-name>
HEAD is now at 0c7f631... Xpetra: reduce number of for-loops in concatenateMaps()
Then we run the checkin-test-sems.sh script to match what is shown above:
$ cd Trilinos.base2/CHECKIN/
$ ./checkin-test-sems.sh --enable-packages=MueLu,Stokhos,Xpetra --local-do-all
After the configure completed and while the build was running I ran ctest -N to see how many tests are reported:
$ ctest -N | grep "Total Tests"
Total Tests: 1146
So that shows 1146 tests, instead of the 1144 tests showed in the commit log for the above commit "0c7f631 Xpetra: reduce number of for-loops in concatenateMaps()". That makes me think that perhaps the person who ran the checkin-test-sems.sh somehow (either on accident or on purpose disabled two tests).
As I have verified that the number of total CI tests run my usage of the checkin-test-sems.sh script locally agrees with the number of post-push CI tests on CDash (1146 each), I will kill the checkin-test-sems.sh script and just run it on the MueLu test suite to see if I can reproduce this failing MueLu test:
$ checkin-test-sems.sh --no-enable-fwd-packages --enable-packages=MueLu --local-do-all
So that returned all passing tests on my machine showing:
100% tests passed, 0 tests failed out of 71
Label Time Summary:
MueLu = 266.48 sec (74 tests)
Total Test time (real) = 192.94 sec
(NOTE: Clearly there is a defect is CTest in it claims there were 74 MueLu tests but there are only 71 total tests :-( I will check with Kitware on that.)
And looking at the MPI_RELEASE_DEBUG_SHARED_PT/ctest.out file, I see:
35/71 Test #33: MueLu_BlockedTransfer_Tpetra_MPI_4 ........................... Passed 1.71 sec
So the test ran and it passed on my RHEL6 machine crf450 and it failed on the post-push CI build run on ceerws1113 shown at:
which showed the failure:
Computing Ac (block) (MueLu::BlockedRAPFactory)
MxM: A x P
p=0: *** Caught standard std::exception of type 'std::logic_error' :
/scratch/rabartl/Trilinos.base/SEMSCIBuild/Trilinos/packages/tpetra/core/src/Tpetra_CrsGraph_def.hpp:860:
Throw number = 1
Throw test that evaluated to true: (!haveGlobalConstants_)
Tpetra::CrsGraph<int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false>::getGlobalNumEntries(): The matrix does not have globalConstants computed, but the user has requested them.
p=2: *** Caught standard std::exception of type 'std::logic_error' :
/scratch/rabartl/Trilinos.base/SEMSCIBuild/Trilinos/packages/tpetra/core/src/Tpetra_CrsGraph_def.hpp:860:
Throw number = 1
Throw test that evaluated to true: (!haveGlobalConstants_)
Tpetra::CrsGraph<int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false>::getGlobalNumEntries(): The matrix does not have globalConstants computed, but the user has requested them.
p=3: *** Caught standard std::exception of type 'std::logic_error' :
/scratch/rabartl/Trilinos.base/SEMSCIBuild/Trilinos/packages/tpetra/core/src/Tpetra_CrsGraph_def.hpp:860:
Throw number = 1
Throw test that evaluated to true: (!haveGlobalConstants_)
Tpetra::CrsGraph<int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false>::getGlobalNumEntries(): The matrix does not have globalConstants computed, but the user has requested them.
p=1: *** Caught standard std::exception of type 'std::logic_error' :
/scratch/rabartl/Trilinos.base/SEMSCIBuild/Trilinos/packages/tpetra/core/src/Tpetra_CrsGraph_def.hpp:860:
Throw number = 1
Throw test that evaluated to true: (!haveGlobalConstants_)
Tpetra::CrsGraph<int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false>::getGlobalNumEntries(): The matrix does not have globalConstants computed, but the user has requested them.
--------------------------------------------------------------------------
mpiexec noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
This is very strange. One would need to investigate further but it appears that there was some type of undefined behavior in the code that caused it to behave differently on different RHEL machines. It would take more effort to dig into this but since it seems to be fixed now, it is likely not worth the effort to do so.
In any case, this failing test was left to linger for days and it did not get fixed until the CI iteration that started at "Aug 25, 2017 - 15:27 UTC":
where this test MueLu_BlockedTransfer_Tpetra_MPI_4 went from failing to passing. The commits pulled this CI iteration are show at:
were:
33b1cc6: Xpetra: Because nested BloockCrsMatrices make things more 'fun'
Author: Chris Siefert <csiefer@sandia.gov>
Date: Fri Aug 25 08:15:02 2017 -0600
M packages/xpetra/sup/Utils/Xpetra_MatrixMatrix.hpp
4c20362: Xpetra: GlobalConstants call for blocked MMM
Author: Chris Siefert <csiefer@sandia.gov>
Date: Thu Aug 24 22:36:13 2017 -0600
M packages/xpetra/sup/Utils/Xpetra_MatrixMatrix.hpp
79ff5a8: MueLu: muemex.cpp warning clean-up
Author: Luc Berger-Vergiat <lberge@sandia.gov>
Date: Fri Aug 25 08:54:14 2017 -0600
M packages/muelu/matlab/bin/muemex.cpp
d97a8dd: MueLu: Last warning clean-up in TentativeP and Aggregation for OpenMP
Author: Luc Berger-Vergiat <lberge@sandia.gov>
Date: Fri Aug 25 08:48:57 2017 -0600
M packages/muelu/src/Graph/UncoupledAggregation/MueLu_AggregationPhase3Algorithm_kokkos_def.hpp
M packages/muelu/src/Transfers/Smoothed-Aggregation/MueLu_TentativePFactory_kokkos_def.hpp
M packages/muelu/test/unit_tests_kokkos/Aggregates_kokkos.cpp
From the summary it is not clear which commit fixed this failing test (but it would not be hard to figure out with a simple manual bisection). But looking at the push log, it seems likely that this test was fixed by the push:
Fri Aug 25 09:23:39 MDT 2017
commit 33b1cc628e01c9f6e22f6a6fd3dd72f3402ebf9f
Author: Chris Siefert <csiefer@sandia.gov>
AuthorDate: Fri Aug 25 08:15:02 2017 -0600
Commit: Chris Siefert <csiefer@sandia.gov>
CommitDate: Fri Aug 25 09:22:35 2017 -0600
Xpetra: Because nested BloockCrsMatrices make things more 'fun'
Build/Test Cases Summary
Enabled Packages: Xpetra
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1146,notpassed=0 (66.87 min)
Other local commits for this build/test group: 4c20362
Commits pushed:
33b1cc6 Xpetra: Because nested BloockCrsMatrices make things more 'fun'
4c20362 Xpetra: GlobalConstants call for blocked MMM
So it looks like the CI build was broken continuously for the better part of 4 days from "Mon Aug 21 12:19:11 MDT 2017" to "Fri Aug 25 09:23:39 MDT 2017". This had to have impacted many attempted pushes.
In any case, it is interesting to see how the Trilinos development community responded to this failure. First, note that this failure was reported in a GitHub issue #1633 created on Aug 22 (yea for @william76!). So the issue got reported the day after the failure (so it was already failing for as much as as day with no action) and the MueLu developers were notified. But the test was not made to be fixed until three days later when the push noted above was performed (thanks @csiefer2!).
What is interesting is that several pushes occurred in that time period, and may of them used the checkin-test-sems.sh script and tested changes to MueLu. How did they do that? As I showed above, that test MueLu_BlockedTransfer_Tpetra_MPI_4 likely passed on some people's machines. But did it fail on other people's machines and impact their pushes?
One example we can see is the push:
Thu Aug 24 10:34:15 MDT 2017
commit bd419666ac10b8fc61304e17f534e93744a105e3
Author: Luc Berger-Vergiat <lberge@sandia.gov>
AuthorDate: Wed Aug 23 14:45:20 2017 -0600
Commit: Luc Berger-Vergiat <lberge@sandia.gov>
CommitDate: Thu Aug 24 10:33:39 2017 -0600
MueLu: catching more kokkos header changes in tests
Build/Test Cases Summary
Enabled Packages: MueLu, Xpetra
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1145,notpassed=0 (150.74 min)
Other local commits for this build/test group: 50f78e1
Commits pushed:
bd41966 MueLu: catching more kokkos header changes in tests
50f78e1 MueLu: change header include Kokkos_CrsMatrix.hpp to KokkosSparse_CrsMatrix.hpp
That push was logged to the trilinos-checkin-tests email list at:
and that log shows:
passed: Trilinos/MPI_RELEASE_DEBUG_SHARED_PT: passed=1145,notpassed=0
Thu Aug 24 10:32:54 MDT 2017
Enabled Packages: MueLu, Xpetra
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages
Hostname: geminga.sandia.gov
Source Dir: /home/lberge/Research/checkin/Trilinos/cmake/tribits/ci_support/../../..
Build Dir: /home/lberge/Research/checkin/Trilinos/checkin/MPI_RELEASE_DEBUG_SHARED_PT
CMake Cache Varibles: ...
Extra CMake Options: -DMueLu_BlockedTransfer_Tpetra_MPI_4_DISABLE=ON
Make Options: -j4
CTest Options: -j4
Pull: Passed (0.00 min)
Configure: Passed (2.65 min)
Build: Passed (110.31 min)
Test: Passed (37.77 min)
100% tests passed, 0 tests failed out of 1145
See the extra CMake options -DMueLu_BlockedTransfer_Tpetra_MPI_4_DISABLE=ON
? So it looks like Luc followed the instructions at:
and disabled this known failing test. That is great!
What about the other pushes during this time period? A grep of the push log file shows:
Fri Aug 25 09:23:39 MDT 2017
Commit: Chris Siefert <csiefer@sandia.gov>
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1146,notpassed=0 (66.87 min)
Fri Aug 25 09:18:24 MDT 2017
Commit: Luc Berger-Vergiat <lberge@sandia.gov>
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=447,notpassed=0 (23.42 min)
Thu Aug 24 20:44:06 MDT 2017
Commit: Chris Siefert <csiefer@sandia.gov>
Thu Aug 24 14:19:12 MDT 2017
Commit: Luc Berger-Vergiat <lberge@sandia.gov>
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=447,notpassed=0 (42.64 min)
Thu Aug 24 12:19:02 MDT 2017
Commit: Mehmet Deveci <mndevec@sandia.gov>
Thu Aug 24 10:52:46 MDT 2017
Commit: Mehmet Deveci <mndevec@sandia.gov>
Thu Aug 24 10:34:15 MDT 2017
Commit: Luc Berger-Vergiat <lberge@sandia.gov>
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1145,notpassed=0 (150.74 min)
Thu Aug 24 09:24:18 MDT 2017
Commit: Michael Wolf <mmwolf@sandia.gov>
Wed Aug 23 19:51:10 MDT 2017
Commit: Christian Robert Trott (-EXP) <crtrott@sandia.gov>
Wed Aug 23 16:29:30 MDT 2017
Commit: Tobias Wiesner <tawiesn@sandia.gov>
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=446,notpassed=0 (27.94 min)
Wed Aug 23 15:00:24 MDT 2017
Commit: Bill Spotz <wfspotz@sandia.gov>
Wed Aug 23 09:06:01 MDT 2017
Commit: Paul Wolfenbarger <prwolfe@users.noreply.github.com>
Wed Aug 23 08:31:09 MDT 2017
Commit: Luc Berger-Vergiat <lberge@sandia.gov>
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=447,notpassed=0 (66.08 min)
Tue Aug 22 20:49:29 MDT 2017
Commit: crtrott <crtrott@sandia.gov>
Tue Aug 22 18:15:37 MDT 2017
Commit: Luc Berger-Vergiat <lberge@sandia.gov>
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=447,notpassed=0 (14.65 min)
Tue Aug 22 17:42:45 MDT 2017
Commit: Matthias Mayr <mmayr@sandia.gov>
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=447,notpassed=0 (24.37 min)
Tue Aug 22 16:30:55 MDT 2017
Commit: Brent Perschbacher <bmpersc@sandia.gov>
Tue Aug 22 16:02:10 MDT 2017
Commit: Jason M. Gates <jmgate@sandia.gov>
Tue Aug 22 15:29:16 MDT 2017
Commit: Kara Peterson <kjpeter@sandia.gov>
Tue Aug 22 15:23:02 MDT 2017
Commit: Luc Berger-Vergiat <lberge@sandia.gov>
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=447,notpassed=0 (6.84 min)
Tue Aug 22 13:40:25 MDT 2017
Commit: Andrey Prokopenko <prokopenkoav@ornl.gov>
Mon Aug 21 12:19:11 MDT 2017
Commit: Matthias Mayr <mmayr@sandia.gov>
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1144,notpassed=0 (123.40 min)
That above shows the first breaking push on "Mon Aug 21 12:19:11 MDT 2017" and the last fixing push on "Fri Aug 25 09:23:39 MDT 2017". In between, there were 21 total pushes but only 8 of these used the checkin-test-sems.sh script and it would only used by:
<lberge@sandia.gov>
<tawiesn@sandia.gov>
<mmayr@sandia.gov>
Several of the people who pushed during that period that usually use the checkin-test-sems.sh script to push did not do so which I think includes:
<csiefer@sandia.gov>
<crtrott@sandia.gov>
<prokopenkoav@ornl.gov>
I wonder if they did not use the checkin-test-sems.sh script to push because of this failing test so they just manually pushed? And perhaps they did not know about how to selectively disable known failing tests as described at:
?
It looks like that was the case for @crtrott as shown in commit 72441d507073a2929003c6e53118036c2eaa0d17.
In a follow up to the the most recent CI build failure described above, I contacted Matthias and he said that he saw the test MueLu_BlockedTransfer_Tpetra_MPI_4 failing on his machine and he was instructed to just disable it locally and push. (That explains the reduced number of tests reported in his push). He did not realize the impact that pushing this failing test would have on the CI testing process and the impact on other developers.
To help address this I will add a new GitHub wiki page describing the proper ways to disable a given test for different use cases. I will then add a section the the checkin-test-sems.sh wiki page on the proper way to address failing tests on one's local machine before pushing so as to avoid breaking the CI (and other) builds.
FYI: The CI build is broken due to failing test Stokhos_KokkosCrsMatrixUQPCEUnitTest_Serial_MPI_1
just pushed (see #1703). I am pushing a disable of that test for the CI build (and only the CI build).
@bartlettroscoe : Is anyone keeping track of the tests that were disabled in the CI build? I assume that these test will be fixed at some point. Are you the point of contact for re-enabling them? Specifically, ROL has some tests that had to be disabled. I recall one (checkAlmostSureConstraint), but I no longer recall others (if any).
@dridzal,
Is anyone keeping track of the tests that were disabled in the CI build? I assume that these test will be fixed at some point. Are you the point of contact for re-enabling them? Specifically, ROL has some tests that had to be disabled. I recall one (checkAlmostSureConstraint), but I no longer recall others (if any).
Yes, there are explicit instructions in every GitHub issue on how the person fixing the failing test should first revert the disable commit before fixing the test locally. See https://github.com/trilinos/Trilinos/issues/1703#issuecomment-327652385.
And if you want to see the current set of disabled tests in the CI build, just look at the bottom of the file:
(That is the correct way to disable a test for only the CI build, not disabling it locally and pushing like happened recently which resulted in a failing test showing up in the post-push CI build and for everyone else.)
Note that one of those currently listed is for a ROL test that has not yet been addressed. See #1596.
The process is:
Note that once we can upgrade CMake/CTest/CDash, thees disabled tests will be displayed on CDash as "Not Run" "Disabled" so you will see them there too (but they will not trigger CDash error emails). See:
Great!
@bartlettroscoe, this may be addressed elsewhere, but is there a best practice for the case that I want to test on multiple machines before pushing (or opening a PR)? For example, with Tpetra, I want to make sure that the standard checkin tests pass on my RHEL blade (obviously), but also different builds on different architectures. In this case, I may push to a branch on my fork and then update and run tests on the several different machines from that branch. The only way (that I know of) to then report test results is to amend the last commit manually with test results and force push. Perhaps there is a better way?
@tjfulle I like your thinking :-) . Right now, one must add text manually to the commit message, explaining what tests passed where. We don't necessarily want to annotate every commit with all the information needed to replicate a test, but some brief mention that (e.g.,) it was tested with CUDA in a debug build could be nice.
@tjfulle:
is there a best practice for the case that I want to test on multiple machines before pushing (or opening a PR)? For example, with Tpetra, I want to make sure that the standard checkin tests pass on my RHEL blade (obviously), but also different builds on different architectures. In this case, I may push to a branch on my fork and then update and run tests on the several different machines from that branch. The only way (that I know of) to then report test results is to amend the last commit manually with test results and force push. Perhaps there is a better way?
@mhoemmen:
I like your thinking :-) . Right now, one must add text manually to the commit message, explaining what tests passed where. We don't necessarily want to annotate every commit with all the information needed to replicate a test, but some brief mention that (e.g.,) it was tested with CUDA in a debug build could be nice.
That would not be too hard to do with crafty usage of the checkin-test.py script. You could run the checkin-test.py script on each machine separately (e.g. with --local-do-all
) for special --extra-st-builds=<specialBuildNamei>
and then you could copy a subset of the <specialBuildNamei>/*.out
files and all of the <specialBuildNamei>/*.success
files to your CEE LAN machine and then the checkin-test-sems.sh script could be run with the full list of --extra-st-builds=<specialBuildName1>,<specialBuildName2>,...
and it would correctly list those extra builds and would amend the top commit message with all of those builds and would also archive the details of those builds to the trilinos-checkin-tests email list on push. If you combine the moving of a branch and remote run approach demonstrated in remote-pull-test-push.sh and combine that with the aggregation of multiple runs of the checkin-test.py script in checkin-test-crf450-cmake-2.8.11.sh and add some scp
commands to copy files back, then you basically have it. One could even write some reusable utility scripts to help drive a process like this so many developers could use it.
But to go any further, we should discuss this more in a separate GitHub issue (since this issue is focused on monitoring the standard CI build).
@ibaned pointed out a case where a broken test stopped a push sometime around July 17 (see https://github.com/trilinos/Trilinos/issues/1511#issuecomment-328863552). Unfortunately, CDash only keeps 6 weeks of results so we can't even see that.
@bartlettroscoe wrote:
But to go any further, we should discuss this more in a separate GitHub issue (since this issue is focused on monitoring the standard CI build).
I opened a new issue, #1725, to discuss this. Thanks!
@ibaned pointed out a case where a broken test stopped a push sometime around July 17 (see #1511 (comment)). Unfortunately, CDash only keeps 6 weeks of results so we can't even see that.
Okay, I misinterpreted that comment in #1511. The issue is that the test Teko_testdriver_tpetra_MPI_1
seems to be passing on some machines and failing on others. That is currently blocking a push for Kokkos in #1721 on @crtrott's machine.
We need to get to the bottom of why this test may not be passing on @crtrott's machine but passes on all of the others involved in Trilinos automated testing.
Note, an issue like this impacts every testing process you can possibly image with a heterogeneous set of machines like we have with Trilinos developers. Even with automated PR testing (i.e. #1155), people still need to be able to reproduce failing builds and tests consistently across machines if possible.
FYI: The machine that runs the standard CI build ceerws1113 will be done from at least 4 pm MDT on 9/15/2017 to at least 6 pm MDT on 9/16/2017. Therefore, Murphy's law for software says that the CI build will be broken when it starts back up :-)
FYI: There were a lot of test failures in the CI build this morning shown at:
This appears to have been due to an env problem of some type breaking MPI exec. For example, for the failing test ThyraCore_test_std_ops_serial_MPI_1, it showed the failure:
-------------------------------------------------------------------------
Open MPI was unable to obtain the username in order to create a path
for its required temporary directories. This type of error is usually
caused by a transient failure of network-based authentication services
(e.g., LDAP or NIS failure due to network congestion), but can also be
an indication of system misconfiguration.
Please consult your system administrator about these issues and try
again.
--------------------------------------------------------------------------
[ceerws1113:62121] [[151,0],0] ORTE_ERROR_LOG: Out of resource in file util/session_dir.c at line 390
[ceerws1113:62121] [[151,0],0] ORTE_ERROR_LOG: Out of resource in file ess_hnp_module.c at line 320
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_session_dir failed
--> Returned value Out of resource (-2) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[ceerws1113:62121] [[151,0],0] ORTE_ERROR_LOG: Out of resource in file runtime/orte_init.c at line 128
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_ess_set_name failed
--> Returned value Out of resource (-2) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[ceerws1113:62121] [[151,0],0] ORTE_ERROR_LOG: Out of resource in file orterun.c at line 694
But these failures went away after the Thyra tests ran.
Therefore, this was not a problem with the code but likely a problem with the CEE LAN env of some type that is used on the machine ceerws1113 that is used to run the standard CI build.
This is a brief comment log some stats about this CI process using the checkin-test.py script in Trilinos. How may people used the checkin-test-sems.sh script to push this year (FY17) as compared to last year (FY16)? I will go with today, 9/16 each year as the boundary.
$ git shortlog -ns --grep="Build/Test Cases Summary" --after="9/16/2015" --before="9/16/2016" | wc -l
31
$ git shortlog -ns --grep="Build/Test Cases Summary" --after="9/16/2016" --before="9/16/2017" | wc -l
34
Wow, that looks like no improvement at all. So really not that many more people have used the checkin-test.py script to push in FY17 as compared to FY16. This is surprising. But how many people enabled all downstream packages? So let's look at that:
[rabartl@crf450 Trilinos (develop)]$ git shortlog -ns --grep="Enabled all Forward Packages" --after="9/16/2015" --before="9/16/2016" | wc -l
29
[rabartl@crf450 Trilinos (develop)]$ git shortlog -ns --grep="Enabled all Forward Packages" --after="9/16/2016" --before="9/16/2017" | wc -l
31
Boy, so that is not that different either.
The real metric that I would like to get is what fractions of pushes that modified source code files in a Trilinos package used the checkin-test.py script to push? Unfortunately, I don't know of any way to generate those statistics before I started logging pushes to Trilinos on May 26 as part of #1362. That is because GitHub does not give you even push stats not to mention the actual push info (like I am collecting now).
And we also can't look at CDash history for any data because it only keeps 6 weeks of history. So we can't say anything about stability or improved productivity (other than developer's and customer's anecdotal statements).
So we have no way to get relevant metrics about the stability of Trilinos or improved usage of the checkin-test-sems.sh script :-( It is a basic premise of empirical software engineering that you need good metrics if you are going to know if changes are making things better or worse. It looks like we just can't get that data for Trilinos (at least not by looking directly at Trilnos itself; perhaps customers could do better).
Hopefully all of this will be unimportant once an effective pull-request based testing and integration process is fully implemented and enforced (i.e. #1155). But even, then we need metrics to know how well that is working. What will those metrics be?
What will those metrics be?
One simple metric I can think of is number of instances where downstream apps find issues. A concrete example is to create a few extra issues labels besides "bug", namely "compile error (Trilinos)", "compile error (application)", "compile warning (Trilinos)", "compile warning (application)", "test failure (Trilinos)", "test failure (application)". Then we can collect statistics on how many issues with each label were opened in a particular period of time. I would also expand "(Trilinos)" to be either "(Trilinos/develop)" or "(Trilinos/master)". Another useful statistic would be, for each such issue, time between opening and closing. If a problem with Trilinos master is found, the issue cannot be closed until the fix reaches the master branch.
One simple metric I can think of is number of instances where downstream apps find issues. A concrete example is to create a few extra issues labels besides "bug", namely "compile error (Trilinos)", "compile error (application)", "compile warning (Trilinos)", "compile warning (application)", "test failure (Trilinos)", "test failure (application)". Then we can collect statistics on how many issues with each label were opened in a particular period of time.
Those would be great. The main challenge with that is that is getting people to actually remember to add the right labels. In my experience, metrics that require a bunch of people to remember to do something will be very incomplete. If possible, it is much better if we can collect metrics that don't require any specific action by anyone (other than the work to set up automated metrics extraction and archiving). It would be best if we could directly monitor the customer application's integration processes with Trilinos and record how frequently they are broken and for how long. But even that would be hard to interpret because "broken" means different things depending on integration model a customer app has chosen (e.g. directly pulling from 'develop' like EMPIRE developers currently do or keeping a seprate repo clone and only updating Trilinos if everything passes for the APP like SPARC or SIERRA).
@bartlettroscoe Your remote check-in test script has been fantastic for me! I love the "fire and forget" feature. I use it for nearly every commit -- the only exceptions have been "emergency" pushes to fix a known build issue.
I would consider adding labels part of triaging a new issue. I think it's good hygiene for Trilinos developers to go through issues now and then, and add labels as appropriate.
it is much better if we can collect metrics that don't require any specific action by anyone (other than the work to set up automated metrics extraction and archiving)
While I agree this is true in the very long run, I think the work to set up automated extraction is quite daunting and unlikely to reach the level of completeness that we can achieve manually unless a lot of dedicated funding is poured into it.
hard to interpret because "broken" means different things depending on integration model a customer app has chosen
This is part of the reason extraction is so hard also, is because each application has such a different infrastructure and practices for testing. What all applications have in common is that they have to get in touch with Trilinos developers to fix any problem, and I think most of that already flows through GitHub. An additional benefit of the manual approach is we will catch reports from users outside our organization, which are still significant compared to internal reports.
The main challenge with that is that is getting people to actually remember to add the right labels.
While there will be a misstep or two inevitably, the Kokkos team has had good success adding labels that have very strict meaning and play a part in an automated workflow (our InDevelop label indicates a fix has been pushed to the develop branch, and all such issues are automatically closed when develop is merged to master).
While there will be a misstep or two inevitably, the Kokkos team has had good success adding labels that have very strict meaning and play a part in an automated workflow (our InDevelop label indicates a fix has been pushed to the develop branch, and all such issues are automatically closed when develop is merged to master).
Okay, I am convinced. Let's create a separate GitHub issue calling something like "Define labels and rules for Trilinos application issues and metrics" and then we can discuss it more there and bring in Mike H. , Jim W., and other interested individuals. But before we define any more labels I think we need to better organize them along the lines of #1619.
The merge of PR #1563 caused a compile error in the standard SEMS build used for checkin, which blocked the checkin of #1767. The details are logged as issue #1772.
@ibaned,
The merge of PR #1563 caused a compile error in the standard SEMS build used for checkin, which blocked the checkin of #1767. The details are logged as issue #1772.
Thanks for catching this so fast! You reported this more that 2.5 hours before the post-push CI build showed this failure (because it was already processing an earlier push that was pretty expensive to rebuild).
FYI: I reverted the bad merge commit referenced in #1772 and it looks like the standard CI build is clean again this morning (and hopefully it got reverted in time not to blow up the various Nightly builds). I provided instructions in #1772 on how to fix go about fixing this and then trying to merge, test, and push again (this time using the checkin-test-sems.sh script to avoid breaking the standard CI build).
FYI: None of the CI builds are showing up today. See #1880.
FYI: There have been random looking build failures showing up on the CI build running on ceerws1113 starting Friday night shown at:
These look like disk write failures. I have disabled the CI build on ceerws1113 until I can determine what is happening. (I am running df -h
but it is hanging.)
FYI:
This morning df -h
is not hanging anymore and it shows that /scratch has 634G of free space. I have restarted the CI server on ceerws1113 and it is posting to CDash at:
We will see what happens from here. If I see any of the same types of system-type failures, I will kill the CI server and investigate further.
FYI: The CI test suites for Tpetra and Xpetra are currently broken as described in #1929. Therefore, if you are pushing changes to Tpetra or Xpetra or packages upstream from these, these will block you push using the checkin-test-sems.sh script. (Or you can locally disable these tests as described at https://github.com/trilinos/Trilinos/wiki/Policies-%7C-Safe-Checkin-Testing#disable_already_failing). I will try to back out these commits today if these are not fixed in the next hour or so.
@bartlettroscoe Please feel free to revert the commits if they are hindering progress. Thanks!
FYI: The CI server on ceerws1113 is showing catastrophic failures again this morning (see #1932). I have killed the CI server and will investigate more carefully this time (see #1932 for details).
FYI: I reverted the not-ready-for-prime-time commits described in #1929 and I manually restarted the CI server on ceerws1113 (and will catch carefully to see that is shuts down tonight in #1932). We should hopefully see a 100% clean CI build again (and pushes should not be stopped right now either).
FYI: Both of the CI builds were broken last night (two failing SEACAS tests) due to a push last night (see details at #2039). I am in the process of disabling these two tests for the CI build. But this should not impact people's pushes with chekcin-test-sems.sh or with the automated PR testing unless they are triggering the enable of SEACAS tests.
Related to: #1362
Next Action Status:
The auto PR testing process (#1155) is deployed and is working fairly well to stabilize 'develop' (at least as good or better than the checkin-test-sems.sh script did). Further improvements will be worked in other issues.
Description:
This story is to discuss and decide how to address stability problems of the Trilinos 'develop' branch in the short term. I know there is a long-term plan to use a PR model (see #1155) but since there are no updates or ETA on that, we need to address stability issues faster than that.
Currently there have been a good deal of stability problems of the Trilinos 'develop' branch, even with the basic CI build linked to from:
and the "Clean" builds shown here:
The "Clean" builds have never been clean in the entire history of the track.
Some very recent examples of failures causing this are described in #1290 and #1301. These have broken the standard CI build and the "Clean" builds continuously since May 4 (and it is still broken as I type this).
We need a strategy to improve stability right now. I have been helping people set up to use the checkin-test-sems.sh script to test and push their changes. I would estimate that a large percentage of the failures (and 100% of the CI failures) seen on CDash would be avoided by usage of the checkin-test-sems.sh script.
CC: @trilinos/framework