bartlettroscoe commented 7 years ago

Related to: #1362

Next Action Status:

The auto PR testing process (#1155) is deployed and is working fairly well to stabilize 'develop' (at least as good or better than the checkin-test-sems.sh script did). Further improvements will be worked in other issues.

Description:

This story is to discuss and decide how to address stability problems of the Trilinos 'develop' branch in the short term. I know there is a long-term plan to use a PR model (see #1155) but since there are no updates or ETA on that, we need to address stability issues faster than that.

Currently there have been a good deal of stability problems of the Trilinos 'develop' branch, even with the basic CI build linked to from:

https://github.com/trilinos/Trilinos/wiki/Policies--%7C-Testing#post_push_ci_testing

and the "Clean" builds shown here:

http://testing.sandia.gov/cdash/index.php?project=Trilinos&display=project&filtercount=2&showfilters=1&filtercombine=and&field1=groupname&compare1=61&value1=Clean&field2=buildstarttime&compare2=84&value2=now

The "Clean" builds have never been clean in the entire history of the track.

Some very recent examples of failures causing this are described in #1290 and #1301. These have broken the standard CI build and the "Clean" builds continuously since May 4 (and it is still broken as I type this).

We need a strategy to improve stability right now. I have been helping people set up to use the checkin-test-sems.sh script to test and push their changes. I would estimate that a large percentage of the failures (and 100% of the CI failures) seen on CDash would be avoided by usage of the checkin-test-sems.sh script.

CC: @trilinos/framework

bartlettroscoe commented 7 years ago

We are still seeing the ROL examples crashing the build for the CI build Linux-GCC-4.7.2-CONTINUOUS_MPI_OPT_DEV_SHARED run on sadl30906.srn.sandia.gov shown as recently morning at:

https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3022470

Therefore, it is time to disable ROL in this CI build. But we never see this build failure in the CI build Linux-GCC-4.8.4-MPI_RELEASE_DEBUG_SHARED_PT_CI run on ceerws1113 so we will not be loosing any testing by making this change.

Wow, I though that the build Linux-GCC-4.7.2-CONTINUOUS_MPI_OPT_DEV_SHARED could not really using GCC 4.7.2 since we I though that must be a misprint. But looking at the configure output at:

https://testing.sandia.gov/cdash/viewConfigure.php?buildid=3022476

we see that it really is using GCC 4.7.2 as shown at:

 -- CMAKE_C_COMPILER_ID='GNU'
-- CMAKE_C_COMPILER_VERSION='4.7.2'
-- CMAKE_CXX_COMPILER_ID='GNU'
-- CMAKE_CXX_COMPILER_VERSION='4.7.2'

This build should likely be turned off until it can be upgraded to GCC 4.8.4. We are no longer supporting C++11 for GCC versions less than 4.8.4 (see #1453).

Therefore, I will disable this build until the @trilinos/framework team can upgrade this build.

brian-kelley commented 7 years ago

@trilinos/amesos2 If somebody familiar with Amesos2 could review and merge PR #1532, that would knock 3 of the 5 remaining test failures off the dashboard.

bartlettroscoe commented 7 years ago

There is a new failing ROL CI test ROL_test_sol_checkAlmostSureConstraint_MPI_1 that was pushed last night shown here:

This failure was triggered by one of the commits:

5140740:  Teuchos: raise Parser sub-package to PS status
Author: Dan Ibanez <daibane@sandia.gov>
Date:   Thu Aug 10 09:19:26 2017 -0600

M   packages/teuchos/cmake/Dependencies.cmake

3bcfc73:  Merge remote branch 'intermediate-repo/develop' into develop
Author: Irina K. Tezaur <ikalash@sandia.gov>
Date:   Thu Aug 10 12:35:18 2017 -0600

25e539c:  Piro: adding ALBANY_BUILD ifdef logic to Piro::TempusSolver to get the right template arguments when constructing a Piro::TempusSolver object in Albany.
Author: Irina K. Tezaur <ikalash@sandia.gov>
Date:   Thu Aug 10 11:34:08 2017 -0700

M   packages/piro/src/Piro_TempusSolver.hpp

shown at:

https://testing.sandia.gov/cdash/viewNotes.php?buildid=3048731##note3

It looks like this is a consequence of the failure described in #1596.

@ibaned, even though enabling TeuchosParser should not have caused this failure, it may be moving things around in memory that might have caused this ROL test to start showing erratic behavior.

I am disabling that failing test ASAP so that this does not trip up anyone else.

bartlettroscoe commented 7 years ago

Wow, my push shown below just this morning dogged the failing ROL test because ROL as no dependency on STK. Others will not be so lucky. I am in the process of running the checkin-test-sems.sh script to disable this failing ROL test for the CI build (but no other builds).

DID PUSH: Trilinos: crf450.srn.sandia.gov

Fri Aug 11 08:55:12 MDT 2017

Enabled Packages: STKUtil
Disabled Packages: PyTrilinos,Claps,TriKota Enabled all Forward Packages

Build test results:
-------------------
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=162,notpassed=0 (41.88 min)

*** Commits for repo :
  96ab45d stk_util: Add optional compilation of this file
  cdc7a62 Make STKUtil dependence on SEACASAprepro_lib optional (stk-17354)

0) MPI_RELEASE_DEBUG_SHARED_PT Results:
---------------------------------------

  passed: Trilinos/MPI_RELEASE_DEBUG_SHARED_PT: passed=162,notpassed=0

  Fri Aug 11 08:55:06 MDT 2017

  Enabled Packages: STKUtil
  Disabled Packages: PyTrilinos,Claps,TriKota
  Enabled all Forward Packages
  Hostname: crf450.srn.sandia.gov
  Source Dir: /home/rabartl/Trilinos.base/Trilinos/cmake/tribits/ci_support/../../..
  Build Dir: /home/rabartl/Trilinos.base/BUILDS/CHECKIN/MPI_RELEASE_DEBUG_SHARED_PT

  CMake Cache Varibles: -DTrilinos_TRIBITS_DIR:PATH=/home/rabartl/Trilinos.base/Trilinos/cmake/tribits -DTrilinos_ENABLE_TESTS:BOOL=ON -DTrilinos_TEST_CATEGORIES:STRING=BASIC -DTrilinos_ALLOW_NO_PACKAGES:BOOL=OFF -DDART_TESTING_TIMEOUT:STRING=300.0 -DBUILD_SHARED_LIBS=ON -DTrilinos_DISABLE_ENABLED_FORWARD_DEP_PACKAGES=ON -DTrilinos_TRACE_ADD_TEST=ON -DTrilinos_ENABLE_SECONDARY_TESTED_CODE:BOOL=OFF -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/MpiReleaseDebugSharedPtSettings.cmake,cmake/std/BasicCiTestingSettings.cmake -DTrilinos_ENABLE_STKUtil:BOOL=ON -DTrilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=ON -DTrilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES:BOOL=ON -DTrilinos_ENABLE_PyTrilinos:BOOL=OFF -DTrilinos_ENABLE_Claps:BOOL=OFF -DTrilinos_ENABLE_TriKota:BOOL=OFF
  Make Options: -j16
  CTest Options: -j16 

  Pull: Passed (0.00 min)
  Configure: Passed (0.63 min)
  Build: Passed (40.21 min)
  Test: Passed (1.04 min)

  100% tests passed, 0 tests failed out of 162

  Label Time Summary:
  Panzer               = 271.30 sec (132 tests)
  STK                  =  24.75 sec (11 tests)
  TrilinosCouplings    =  41.60 sec (19 tests)

  Total Test time (real) =  62.23 sec

  Total time for MPI_RELEASE_DEBUG_SHARED_PT = 41.88 min

bartlettroscoe commented 7 years ago

I just pushed a commit to disable this failing ROL test in the CI build (see https://github.com/trilinos/Trilinos/issues/1596#issuecomment-321843345). However, it will not be in time to help @ikalash who looks to be running the checkin-test-sems.sh script right now on ceerws1113 trying to push. Her push will likely be blocked due to this failing test. But all she needs to do is to run it again and it will pass since I have pushed the commit to disable it (and I sent her and email stating that).

NOTE: I am going on vacation for the next two weeks after today and will not be back till Monday 8/28. While I am gone, can someone on the @trilinos/framework team keep an eye on this CI build and resolve issues like this (and restor 100% passing ASAP by disabling tests or backing out commits when needed)? If we get lucky, no problems will pop up while I am gone. But we barely made it two weeks since the last failure (see above). If you don't keep the CI build 100% clean at all times, things break down very quickly.

Please lets get the automated PR testing and merging system stood up (#1155)!

bartlettroscoe commented 7 years ago

CI build was clean again as yesterday in the first CI build:

https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3050229&filtercount=3&showfilters=1&field1=groupname&compare1=61&value1=Continuous&field2=buildstarttime&compare2=84&value2=now&filtercombine=and

However, it just got broke again. I will comment on that in the next comment.

bartlettroscoe commented 7 years ago

The CI build was passing for all of 6 hours before it was broken again with an Intrepid2 test build failure (see #1600) shown at:

https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3050255

While this test build failure in Intrepid2 persists, anyone trying to use the checkin-test-sems.sh script to push changes to the following upstream packages from Intrepid2 will have their pushes stopped:

Kokkos Teuchos KokkosKernels RTOp Sacado Epetra Shards Triutils Tpetra TrilinosSS EpetraExt Thyra Xpetra Galeri Amesos Pamgen

I will work to externally disable just that one test build so that it will not trip up anyone until it can be fixed.

bartlettroscoe commented 7 years ago

I surgically disabled just that one failing Intrepid2 test as described at https://github.com/trilinos/Trilinos/issues/1600#issuecomment-321990301. The next CI iteration should be clean.

I also provided full instructions on how to revert the disable, fix the failure, and push using checkin-test-sems.sh to avoid another breakage of the CI build.

bartlettroscoe commented 7 years ago

BTW, as discussed above, the problematic CI build Linux-GCC-4.8.4-MPI_RELEASE_DEBUG_SHARED_PT_CI has been disabled as of 8/10 as shown at:

https://testing.sandia.gov/cdash/index.php?project=Trilinos&display=project&filtercount=3&showfilters=1&filtercombine=and&field1=groupname&compare1=61&value1=Continuous&field2=buildstarttime&compare2=84&value2=now&field3=buildname&compare3=62&value3=Linux-GCC-4.8.4-MPI_RELEASE_DEBUG_SHARED_PT_CI

Now just the one single CI build shown at:

https://testing.sandia.gov/cdash/index.php?project=Trilinos&display=project&filtercount=2&showfilters=1&filtercombine=and&field1=groupname&compare1=61&value1=Continuous&field2=buildstarttime&compare2=84&value2=now

is running. Now we just need to keep it clean.

bartlettroscoe commented 7 years ago

The CI build for Inprepid2 is clean again as shown at:

https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3051729

You can see that Intrepid2 test being disabled at:

https://testing.sandia.gov/cdash/viewConfigure.php?buildid=3051730

which shows:

-- Intrepid2_refactor_unit-test_Orientation_Serial_Test_OrientationToolsCoeffMatrix EXE NOT being built due to Intrepid2_refactor_unit-test_Orientation_Serial_Test_OrientationToolsCoeffMatrix_EXE_DISABLE='ON'
-- Intrepid2_refactor_unit-test_Orientation_Serial_Test_OrientationToolsCoeffMatrix_MPI_1: NOT added test because Intrepid2_refactor_unit-test_Orientation_Serial_Test_OrientationToolsCoeffMatrix_MPI_1_DISABLE='ON'!

And you can see a "-1" for the number of "Not Run" tests for Intrepid2 shown at:

https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3051729

ndellingwood commented 7 years ago

Merged @brian-kelley PR #1532 regarding the Amesos2 failures. This is also tracked in #1495.

lucbv commented 7 years ago

@bartlettroscoe is it possible to have the checkin-script set $OMP_PROC_BIND=false when running the tests? It took me a couple tries to realize that having $OMP_PROC_BIND=true was the reason behind timing out on most of the tests when run with mpi.

ibaned commented 7 years ago

1625

bartlettroscoe commented 7 years ago

Getting back from a 2-week vacation looking at the CI build while I was gone at:

https://testing.sandia.gov/cdash/index.php?project=Trilinos&filtercount=3&showfilters=1&filtercombine=and&field1=buildname&compare1=66&value1=-MPI_RELEASE_DEBUG_SHARED_PT_CI&field2=groupname&compare2=61&value2=Continuous&field3=buildstarttime&compare3=84&value3=now

The good news is the the CI build appears to be completely clean since later in the day last Friday 8/25/2017.

The bad news is that it looks like the CI build was broken at least 4 separate times and the last breakage lasted from 8/22/2017 to 8/25/2017. Anyone trying to use the checkin-test-sems.sh script to push to an upstream package during that time would of had their pushes stopped (I will see if there is any evidence to for that).

Since CDash only records a 6 week moving window, I will document each of the failures in comments here for archival purposes and for later analysis. I will write one comment for each failure (4 comments total).

ibaned commented 7 years ago

@bartlettroscoe for the record, the Kokkos 2.04.00 snapshot commit (6811bb33bdcb1633c2b9f7cb62e94a43ef057f6c) using checkin-test-sems was blocked by the test failure documented in #1615 which was caused by the STK snapshot that arrived about a day earlier.

bartlettroscoe commented 7 years ago

The first failure that occured over the period 8/13/2017 - 8/28/2017 was caused by the push:

Wed Aug 16 12:37:48 MDT 2017

commit e15b6358a6aa24d080ef9840816d1c3c47df5fd8
Author:     Brent Perschbacher <bmpersc@sandia.gov>
AuthorDate: Tue Aug 15 15:28:37 2017 -0600
Commit:     Brent Perschbacher <bmpersc@sandia.gov>
CommitDate: Wed Aug 16 12:37:02 2017 -0600

    Snapshot of sierra.base.git from commit 888d2b5321002c5ca3e595ad8ccf14a9c9b4addc

    From repository at sierra-git.sandia.gov:/git/sierra.base.git

    At commit:
    commit 888d2b5321002c5ca3e595ad8ccf14a9c9b4addc
    Author: Greg Sjaardema <gdsjaar@sandia.gov>
    Date:   Mon Aug 14 10:35:14 2017 -0600

        APREPRO: Fix so will compile with intel-14

Commits pushed:
e15b635 Snapshot of sierra.base.git from commit 888d2b5321002c5ca3e595ad8ccf14a9c9b4addc
0125af1 Snapshot of sierra.base.git from commit d3329e7df739281ba6d5955d965ff2dcf6e3864d

and persisted for a little over 7 hours from "Wed Aug 16 12:37:48 MDT 2017" till "Wed Aug 16 17:55:21 MDT 2017". That push did not appear to use the checkin-tset-sems.sh script.

As noted above, this breakage stopped at least one push to Trilinos. It is unknown if this stopped any other pushes (we would have to ask since there is no archiving of failed invocations of checkin-test-sems.sh).

The following pushes occurred during that period:

commit 3ddc1f116745766ec4c6a138e0e269c1fc863ac0
Merge: e15b635 9407f03
Author:     Irina K. Tezaur <ikalash@sandia.gov>
AuthorDate: Wed Aug 16 14:55:18 2017 -0600
Commit:     Irina K. Tezaur <ikalash@sandia.gov>
CommitDate: Wed Aug 16 15:36:51 2017 -0600

    Merge remote branch 'intermediate-repo/develop' into develop

    Build/Test Cases Summary
    Enabled Packages: Piro
    Disabled Packages: PyTrilinos,Claps,TriKota
    Enabled all Forward Packages
    0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=147,notpassed=0 (41.40 min)
    Other local commits for this build/test group: 9407f03

Commits pushed:
3ddc1f1 Merge remote branch 'intermediate-repo/develop' into develop
9407f03 Piro: adding setObserver() method to TempusSolver class.

commit 0bf149383272f8c5562c8b97736b35a02d93990d
Author:     Jonathan Hu <jhu@sandia.gov>
AuthorDate: Wed Aug 16 13:58:43 2017 -0700
Commit:     Jonathan Hu <jhu@sandia.gov>
CommitDate: Wed Aug 16 15:57:14 2017 -0700

    MueLu: rebase interface tests

    Build/Test Cases Summary
    Enabled Packages: MueLu
    Disabled Packages: PyTrilinos,Claps,TriKota
    Enabled all Forward Packages
    0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=148,notpassed=0 (117.23 min)
    Other local commits for this build/test group: ad8a580, ae678d9, 2d730ee

Commits pushed:
0bf1493 MueLu: rebase interface tests
ad8a580 MueLu: remove option "rowWeight"
ae678d9 MueLu: avoid partition assignment step if possible
2d730ee MueLu: fix typo in message

Luckely, MueLu and Piro don't have STK as a downstream dependency so these usages of the checkin-test-sems.sh script were not blocked. But unfortunately, STK is downstream from Kokkos which blocked the Kokkos push noted above.

Also, if anyone would have tried to push to Trilinos from any of the following packages (which are all upstream dependencies of STK):

Gtest ThreadPool Kokkos Teuchos KokkosKernels RTOp Sacado Epetra Zoltan Shards Triutils Tpetra TrilinosSS EpetraExt Thyra Xpetra Galeri Amesos Pamgen SEACAS Intrepid STK

these would have also of had their pushes stopped due to this failure.

DETAILS:

The first failure that occured over the period 8/13/2017 - 8/28/2017 was first shown in the CI iteration started "Aug 16, 2017 - 18:38 UTC":

https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3058852

which was the addition of a new failing test STKUnit_tests_stk_tools_unit_tests_MPI_4. The Updates.txt notes file for that CI iteration shown at:

https://testing.sandia.gov/cdash/viewNotes.php?buildid=3058855##note2

shows the two commits:

e15b635:  Snapshot of sierra.base.git from commit 888d2b5321002c5ca3e595ad8ccf14a9c9b4addc
Author: Brent Perschbacher <bmpersc@sandia.gov>
Date:   Tue Aug 15 15:28:37 2017 -0600

...

0125af1:  Snapshot of sierra.base.git from commit d3329e7df739281ba6d5955d965ff2dcf6e3864d
Author: Brent Perschbacher <bmpersc@sandia.gov>
Date:   Tue Aug 15 15:28:33 2017 -0600

...

That coorespends to the recorded push:

Wed Aug 16 12:37:48 MDT 2017

commit e15b6358a6aa24d080ef9840816d1c3c47df5fd8
Author:     Brent Perschbacher <bmpersc@sandia.gov>
AuthorDate: Tue Aug 15 15:28:37 2017 -0600
Commit:     Brent Perschbacher <bmpersc@sandia.gov>
CommitDate: Wed Aug 16 12:37:02 2017 -0600

    Snapshot of sierra.base.git from commit 888d2b5321002c5ca3e595ad8ccf14a9c9b4addc

    From repository at sierra-git.sandia.gov:/git/sierra.base.git

    At commit:
    commit 888d2b5321002c5ca3e595ad8ccf14a9c9b4addc
    Author: Greg Sjaardema <gdsjaar@sandia.gov>
    Date:   Mon Aug 14 10:35:14 2017 -0600

        APREPRO: Fix so will compile with intel-14

Commits pushed:
e15b635 Snapshot of sierra.base.git from commit 888d2b5321002c5ca3e595ad8ccf14a9c9b4addc
0125af1 Snapshot of sierra.base.git from commit d3329e7df739281ba6d5955d965ff2dcf6e3864d

These commits and push are part of the offical integration process for changes to STK and SEACAS in SIERRA back to Trilinos (one commit for STK and one commit for SEACAS). The testing process for those commits does not use the checkin-test-sems.sh script, which allowed this broken test to get pushed. This test was disabled in a follow-up push and the first fixed CI iteration started at "Aug 17, 2017 - 10:00 UTC" was

https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3060208

which involved the commits pulled in:

https://testing.sandia.gov/cdash/viewNotes.php?buildid=3060433##note2

which included the commit:

ddb3fe7:  STK: Disable stk_tools test due to continuous failure.
Author: Brent Perschbacher <bmpersc@sandia.gov>
Date:   Wed Aug 16 17:50:59 2017 -0600

M   packages/stk/stk_unit_tests/stk_tools/CMakeLists.txt

That commit was pushed as part of the push:

Wed Aug 16 17:55:21 MDT 2017

commit ddb3fe783a7d6aa8390429dae4c974e8f847079a
Author:     Brent Perschbacher <bmpersc@sandia.gov>
AuthorDate: Wed Aug 16 17:50:59 2017 -0600
Commit:     Brent Perschbacher <bmpersc@sandia.gov>
CommitDate: Wed Aug 16 17:53:01 2017 -0600

    STK: Disable stk_tools test due to continuous failure.

    Issue was reported in #1615. I need to speak with the STK team
    to figure out the right fix, but for now disabling.

Commits pushed:
ddb3fe7 STK: Disable stk_tools test due to continuous failure.

This means the CI build was broken for over 7 hours.

This issue was fixed back in native SIERRA sources for Trilinos and was then snapshotted back to Trilinos and that test STKUnit_tests_stk_tools_unit_tests_MPI_4 reappeared in the CI iteration started at "Aug 18, 2017 - 17:59 UTC ":

https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3062322

bartlettroscoe commented 7 years ago

The second failure that occurred over the period 8/13/2017 - 8/28/2017 was caused by the push:

Thu Aug 17 11:03:52 MDT 2017

commit 79def1e59538a35535afb1fb6e43bebf7d105805
Author:     Chris Siefert <csiefer@sandia.gov>
AuthorDate: Thu Aug 17 11:01:08 2017 -0600
Commit:     Chris Siefert <csiefer@sandia.gov>
CommitDate: Thu Aug 17 11:03:17 2017 -0600

    Ctest: Adding -fno-var-tracking to dorksaber builds to remove weird gcc warning

Commits pushed:
79def1e Ctest: Adding -fno-var-tracking to dorksaber builds to remove weird gcc warning
36340b7 MueLu: Hopefully fixing outstanding MueLu error on Geminga

It looks like that push did not use the checkin-test-sems.sh script, which allowed the error to get pushed.

This resulted in the CI build of Trilinos to be broken continuously for only about an hour from "Thu Aug 17 11:03:52 MDT 2017" to "Thu Aug 17 12:06:46 MDT 2017". Therefore, it seems unlikely that anyone's pushes would have been stopped due to this. And only people pushing to packages upstream from Xpetra:

Kokkos Teuchos KokkosKernels RTOp Epetra Triutils Tpetra EpetraExt Thyra Xpetra

would of had their push stopped. So it is unlikely that anyone was inconvenienced by this bad push.

DETAILS:

The second failure that occurred over the period 8/13/2017 - 8/28/2017 was captured in the CI iteration started at "Aug 17, 2017 - 17:05 UTC":

https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3060511

and showed the failing test Xpetra_BlockedCrsMatrix_UnitTests_MPI_4. The commits pulled in this CI iteration are shown at:

https://testing.sandia.gov/cdash/viewNotes.php?buildid=3060512##note3

which shows the commits:

930f58a:  Ctest: More dorksaber warning cleanup
Author: Chris Siefert <csiefer@sandia.gov>
Date:   Thu Aug 17 11:04:35 2017 -0600

M   cmake/ctest/drivers/dorksaber/TrilinosCTestDriverCore.dorksaber.gcc.cmake

79def1e:  Ctest: Adding -fno-var-tracking to dorksaber builds to remove weird gcc warning
Author: Chris Siefert <csiefer@sandia.gov>
Date:   Thu Aug 17 11:01:08 2017 -0600

M   cmake/ctest/drivers/dorksaber/ctest_linux_nightly_mpi_release_tpetrakernels_experimental_dorksaber.cmake
M   cmake/ctest/drivers/dorksaber/ctest_linux_nightly_serial_release_muelu_matlab_dorksaber.cmake

36340b7:  MueLu: Hopefully fixing outstanding MueLu error on Geminga
Author: Chris Siefert <csiefer@sandia.gov>
Date:   Thu Jul 6 14:27:05 2017 -0600

M   packages/xpetra/sup/Utils/Xpetra_MatrixMatrix.hpp

Therefore, this failure likely corresponds to the push:

Thu Aug 17 11:03:52 MDT 2017

commit 79def1e59538a35535afb1fb6e43bebf7d105805
Author:     Chris Siefert <csiefer@sandia.gov>
AuthorDate: Thu Aug 17 11:01:08 2017 -0600
Commit:     Chris Siefert <csiefer@sandia.gov>
CommitDate: Thu Aug 17 11:03:17 2017 -0600

    Ctest: Adding -fno-var-tracking to dorksaber builds to remove weird gcc warning

Commits pushed:
79def1e Ctest: Adding -fno-var-tracking to dorksaber builds to remove weird gcc warning
36340b7 MueLu: Hopefully fixing outstanding MueLu error on Geminga

As shown in the commit log, it does not look like the checkin-test-sems.sh script was used to test and push this (which explains how this error was able to get pushed).

This was fixed pretty quickly in the very next CI iteration started at "Aug 17, 2017 - 19:49 UTC":

https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3060560

with that test going from failing to passing. The Updates.txt file for that CI iteration shown at:

https://testing.sandia.gov/cdash/viewNotes.php?buildid=3060561##note2

shows the commits:

4d0b31:  MueLu: clean up Aria driver
Author: Tobias Wiesner <tawiesn@sandia.gov>
Date:   Tue Aug 15 15:54:18 2017 -0600

M   packages/muelu/research/tawiesn/aria/Driver.cpp

5f46414:  MueLu: remove FacadeClassFactory from Crada driver routine
Author: Tobias Wiesner <tawiesn@sandia.gov>
Date:   Tue Aug 15 14:27:00 2017 -0600

M   packages/muelu/research/tawiesn/crada/Driver.cpp

65b1fdc:  Revert "MueLu: Hopefully fixing outstanding MueLu error on Geminga"
Author: Chris Siefert <csiefer@sandia.gov>
Date:   Thu Aug 17 12:05:41 2017 -0600

M   packages/xpetra/sup/Utils/Xpetra_MatrixMatrix.hpp

So the fixing commit was likely part of the push:

Thu Aug 17 12:06:46 MDT 2017

commit 65b1fdc934709b7cbce9326904fa9652d0492eb7
Author:     Chris Siefert <csiefer@sandia.gov>
AuthorDate: Thu Aug 17 12:05:41 2017 -0600
Commit:     Chris Siefert <csiefer@sandia.gov>
CommitDate: Thu Aug 17 12:05:41 2017 -0600

    Revert "MueLu: Hopefully fixing outstanding MueLu error on Geminga"

    This reverts commit 36340b745843e3ebc801567fa086c6fb86b48c1f.

Commits pushed:
65b1fdc Revert "MueLu: Hopefully fixing outstanding MueLu error on Geminga"

So this was fixed quickly by just reverting the commit.

bartlettroscoe commented 7 years ago

The third CI failure that occurred over the period 8/13/2017 - 8/28/2017 was caused by the push:

Fri Aug 18 15:51:37 MDT 2017

commit ec85c46917ac532a676f68ce2a27b305fbfbb4f9
Merge: 5eeea40 8d29d08
Author:     Mehmet Deveci <mndevec@sandia.gov>
AuthorDate: Fri Aug 18 15:51:14 2017 -0600
Commit:     Mehmet Deveci <mndevec@sandia.gov>
CommitDate: Fri Aug 18 15:51:14 2017 -0600

    Merge branch 'develop' of github.com:trilinos/Trilinos into develop

Commits pushed:
ec85c46 Merge branch 'develop' of github.com:trilinos/Trilinos into develop
5eeea40 Ifpack2: added a parameter to avoid symmetrization and write the given matrix to output. #1622

which broke the two tests MueLu_ParameterListInterpreterTpetra_MPI_1 and MueLu_ParameterListInterpreterTpetra_MPI_4.

It looks like that push did not use the checkin-test-sems.sh script, which allowed the error to get pushed.

The failing tests caused by this push were later fixed by a different developer in the push:

Sat Aug 19 10:59:57 MDT 2017

commit 0ffff8da7fc16b6aa231052efee836c065a23421
Author:     Andrey Prokopenko <prokopenkoav@ornl.gov>
AuthorDate: Thu Aug 17 13:14:54 2017 -0400
Commit:     Andrey Prokopenko <prokopenkoav@ornl.gov>
CommitDate: Sat Aug 19 12:59:46 2017 -0400

    MueLu: remove Tpetra version of Isorropia

    Build/Test Cases Summary
    Enabled Packages: MueLu
    Enabled all Forward Packages
    0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=430,notpassed=0 (85.98 min)
    Other local commits for this build/test group: a93e979917

Commits pushed:
0ffff8d MueLu: remove Tpetra version of Isorropia
a93e979 MueLu: updating interface tests for Ifpack2 5eeea40 changes

This resulted in the CI build of Trilinos to be broken continuously for 19 hours from "Fri Aug 18 15:51:37 MDT 2017" to "Sat Aug 19 10:59:57 MDT 2017". Therefore, anyone who would have tried to push to MueLu or any of its upstream packages:

Kokkos Teuchos KokkosKernels RTOp Sacado Epetra Zoltan Shards Triutils Tpetra TrilinosSS EpetraExt Thyra Xpetra Isorropia AztecOO Galeri Amesos Pamgen Zoltan2 Ifpack ML Belos Amesos2 Anasazi Ifpack2 Stratimikos Teko Intrepid2 MueLu

using the checkin-test-sems.sh script during that time period would of had their push stopped.

The only push that occurred between the breaking and fixing push recorded was:

Sat Aug 19 00:42:08 MDT 2017

commit ba9cd117e4ac749f36c4d41240f06512df332915
Author:     Mauro Perego <mperego@sandia.gov>
AuthorDate: Fri Aug 18 18:40:35 2017 -0600
Commit:     Mauro Perego <mperego@sandia.gov>
CommitDate: Fri Aug 18 19:41:09 2017 -0600

    Intrepid2: Added unit-test for testing otrientation tools for Tet. Still working on it. Tested HGRAD and HDIV. HDIV works only for low order basis functions.

Commits pushed:
ba9cd11 Intrepid2: Added unit-test for testing otrientation tools for Tet. Still working on it. Tested HGRAD and HDIV. HDIV works only for low order basis functions.
fdda779 Intrepid2: allow function clone to accept an input view with rank 3

Since MueLu does not depend on Intrepid2, this push would have been allowed to go through (but as you see it does not look like the checkin-test-sems.sh script was used for this push either). However, it is not clear if anyone's pushes were stopped during this time period.

Also note that there was a a single failure of the test PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Order-1 that appears to have not been caused or fixed by any commit. Therefore, we need to keep an eye on this test as being a potentially fragile test.

DETAILS:

The third failure that occurred over the period 8/13/2017 - 8/28/2017 was captured in the CI iteration started at "Aug 18, 2017 - 22:03 UTC":

https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3062354

with the failing tests MueLu_ParameterListInterpreterTpetra_MPI_1 and MueLu_ParameterListInterpreterTpetra_MPI_4. The commits pulled this CI iteration shown at:

https://testing.sandia.gov/cdash/viewNotes.php?buildid=3062355##note0

where:

ec85c46:  Merge branch 'develop' of github.com:trilinos/Trilinos into develop
Author: Mehmet Deveci <mndevec@sandia.gov>
Date:   Fri Aug 18 15:51:14 2017 -0600

5eeea40:  Ifpack2: added a parameter to avoid symmetrization and write the given matrix to output. #1622
Author: Mehmet Deveci <mndevec@sandia.gov>
Date:   Fri Aug 18 15:50:53 2017 -0600

M   packages/ifpack2/src/Ifpack2_Relaxation_decl.hpp
M   packages/ifpack2/src/Ifpack2_Relaxation_def.hpp

8d29d08:  MueLu: replace tabs by spaces
Author: Tobias Wiesner <tawiesn@sandia.gov>
Date:   Fri Aug 18 11:43:50 2017 -0600

M   packages/muelu/research/max/XpetraSplitting/Test_muelu.cpp
M   packages/muelu/research/max/XpetraSplitting/Test_xpetra.cpp
M   packages/muelu/research/max/XpetraSplitting/Xpetra_Level_def.hpp
M   packages/muelu/research/max/XpetraSplitting/Xpetra_MatrixSplitting.hpp
M   packages/muelu/research/max/XpetraSplitting/Xpetra_RegionAMG_decl.hpp
M   packages/muelu/research/max/XpetraSplitting/Xpetra_RegionAMG_def.hpp
M   packages/muelu/research/max/XpetraSplitting/Xpetra_RegionHandler_decl.hpp
M   packages/muelu/research/max/XpetraSplitting/Xpetra_RegionHandler_def.hpp

7a37f4a:  MueLu: add Belos solver to Aria Driver
Author: Tobias Wiesner <tawiesn@sandia.gov>
Date:   Fri Aug 18 11:18:38 2017 -0600

M   packages/muelu/research/tawiesn/aria/Driver.cpp

From looking at the set of tests that failed and the commits pushed, it is not clear what caused the failure. From looking at the details from the failing tests at:

I am seeing similar failures. For example, I see the same failure in both tests:

 Level 2
  Build (MueLu::RebalanceTransferFactory)
EasyParameterListInterpreter/repartition4_np4.xml : failed

These same two MueLu tests failed the next CI iteration started at "Aug 19, 2017 - 10:00 UTC ":

https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3063601

In addition, a new Panzer test PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Order-1 failed in that CI iteration. The commits that CI iteration shown at:

https://testing.sandia.gov/cdash/viewNotes.php?buildid=3063836##note0

were:

ba9cd11:  Intrepid2: Added unit-test for testing otrientation tools for Tet. Still working on it. Tested HGRAD and HDIV. HDIV works only for low order basis functions.
Author: Mauro Perego <mperego@sandia.gov>
Date:   Fri Aug 18 18:40:35 2017 -0600

M   packages/intrepid2/refactor/unit-test/Orientation/Serial/CMakeLists.txt
A   packages/intrepid2/refactor/unit-test/Orientation/Serial/test_orientation_TET.cpp
A   packages/intrepid2/refactor/unit-test/Orientation/test_orientation_TET.hpp

fdda779:  Intrepid2: allow function clone to accept an input view with rank 3
Author: Mauro Perego <mperego@sandia.gov>
Date:   Fri Aug 18 18:37:38 2017 -0600

M   packages/intrepid2/refactor/src/Shared/Intrepid2_RealSpaceToolsDef.hpp

Neither of these commits would seem to be responsible for this new Panzer test failure so that could be a fluke.

The CI build after that starting "Aug 19, 2017 - 17:01 UTC":

https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3063880

was totally clean showing these three tests MueLu_ParameterListInterpreterTpetra_MPI_1, and MueLu_ParameterListInterpreterTpetra_MPI_4, and PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Order-1 moving from failing to passing.

The commits pulled this CI iteration shown at:

https://testing.sandia.gov/cdash/viewNotes.php?buildid=3063881##note0

were:

0ffff8d:  MueLu: remove Tpetra version of Isorropia
Author: Andrey Prokopenko <prokopenkoav@ornl.gov>
Date:   Thu Aug 17 13:14:54 2017 -0400

M   packages/muelu/src/Rebalancing/MueLu_IsorropiaInterface_decl.hpp
M   packages/muelu/src/Rebalancing/MueLu_IsorropiaInterface_def.hpp

a93e979:  MueLu: updating interface tests for Ifpack2 5eeea40 changes
Author: Andrey Prokopenko <prokopenkoav@ornl.gov>
Date:   Sat Aug 19 11:25:17 2017 -0400

M   packages/muelu/test/interface/Output/MLaux_tpetra.gold
M   packages/muelu/test/interface/Output/MLcoarse1_tpetra.gold
M   packages/muelu/test/interface/Output/MLcoarse2_tpetra.gold
...

So it looks like Andrey fixed the failing MueLu tests that were caused by earlier commit 5eeea40. The push that pushed commit 5eeea40 was:

Fri Aug 18 15:51:37 MDT 2017

commit ec85c46917ac532a676f68ce2a27b305fbfbb4f9
Merge: 5eeea40 8d29d08
Author:     Mehmet Deveci <mndevec@sandia.gov>
AuthorDate: Fri Aug 18 15:51:14 2017 -0600
Commit:     Mehmet Deveci <mndevec@sandia.gov>
CommitDate: Fri Aug 18 15:51:14 2017 -0600

    Merge branch 'develop' of github.com:trilinos/Trilinos into develop

Commits pushed:
ec85c46 Merge branch 'develop' of github.com:trilinos/Trilinos into develop
5eeea40 Ifpack2: added a parameter to avoid symmetrization and write the given matrix to output. #1622

As you can see, there is no sign that the checkin-test-sems.sh script was used to push these commits, which would explain the failures that occurred.

The push that fixed this was:

Sat Aug 19 10:59:57 MDT 2017

commit 0ffff8da7fc16b6aa231052efee836c065a23421
Author:     Andrey Prokopenko <prokopenkoav@ornl.gov>
AuthorDate: Thu Aug 17 13:14:54 2017 -0400
Commit:     Andrey Prokopenko <prokopenkoav@ornl.gov>
CommitDate: Sat Aug 19 12:59:46 2017 -0400

    MueLu: remove Tpetra version of Isorropia

    Build/Test Cases Summary
    Enabled Packages: MueLu
    Enabled all Forward Packages
    0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=430,notpassed=0 (85.98 min)
    Other local commits for this build/test group: a93e979917

Commits pushed:
0ffff8d MueLu: remove Tpetra version of Isorropia
a93e979 MueLu: updating interface tests for Ifpack2 5eeea40 changes

Therefore, the CI build of Trilinos (with failing MueLu tests) was broken from "Fri Aug 18 15:51:37 MDT 2017" to "Sat Aug 19 10:59:57 MDT 2017", or about 19 hours.

bartlettroscoe commented 7 years ago

The fourth and final CI failure that occurred over the period 8/13/2017 - 8/28/2017 was caused by the push:

Mon Aug 21 12:19:11 MDT 2017

commit 0c7f6312ff2fe596f672ee9b771ca989ee61afe1
Author:     Matthias Mayr <mmayr@sandia.gov>
AuthorDate: Fri Jun 30 10:09:28 2017 -0700
Commit:     Matthias Mayr <mmayr@sandia.gov>
CommitDate: Mon Aug 21 11:18:20 2017 -0700

    Xpetra: reduce number of for-loops in concatenateMaps()

    Build/Test Cases Summary
    Enabled Packages: MueLu, Stokhos, Xpetra
    Enabled all Forward Packages
    0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1144,notpassed=0 (123.40 min)
    Other local commits for this build/test group: ffe7619, bd7e914, 130bcaf, cc0b9a7, c406050, e995e6c

Commits pushed:
0c7f631 Xpetra: reduce number of for-loops in concatenateMaps()
ffe7619 Xpetra: updated doxygen documentation
bd7e914 MueLu: fixed compiler warning
130bcaf Xpetra: update documenation for bgs_apply
cc0b9a7 MueLu: Added BlockedJacobiSmoother
c406050 updated list of developers
e995e6c Removed HAVE_MUELU_EXPERIMENTAL guard for blocked operators

which broke the test MueLu_BlockedTransfer_Tpetra_MPI_4 for the post-push CI build run on ceerws1113. The checkin-test-sems.sh script looks to have been used but there are only 1144 tests shown and there should have been 1146 tests run (see below) so it is not clear if this test was passing on this person's machine or was failing and they explicilty disabled it or if something else happened (the push was not logged to the trilinos-checkin-tests email list so we can't see). But this test for the same version of Trilinos passed on my machine crf450 so it seems this test passed on some platforms and failed on others (i.e. a very badly behaving test).

This resulted in the CI build of Trilinos to be broken continuously for the better part of 4 days from "Mon Aug 21 12:19:11 MDT 2017" to "Fri Aug 25 09:23:39 MDT 2017". Therefore, anyone who would have tried to push to MueLu or any of its upstream packages:

Kokkos Teuchos KokkosKernels RTOp Sacado Epetra Zoltan Shards Triutils Tpetra TrilinosSS EpetraExt Thyra Xpetra Isorropia AztecOO Galeri Amesos Pamgen Zoltan2 Ifpack ML Belos Amesos2 Anasazi Ifpack2 Stratimikos Teko Intrepid2 MueLu

using the checkin-test-sems.sh script during that time period might of had their push stopped. There is evidence that occurred to several people (see below).

This failure represents a difficult case in that the test appears to have passed on some machines but failed on others. But it represents a bit of a failure of the development community that took the better part of 4 days to address. In the meantime, it definely impacted people's work (as evidenced below).

To see how big of an impact this had on people's productivity and get to the bottom of what happened we would need to:

Ask Matthias Mayr <mmayr@sandia.gov> if he ever saw that test failing on his machine.
Confirm with Luc Berger-Vergiat <lberge@sandia.gov> if he saw this test failing on his machine.
Ask Chris Siefert <csiefer@sandia.gov>, crtrott <crtrott@sandia.gov>, and Andrey Prokopenko <prokopenkoav@ornl.gov> if they saw this test failing on their machines and see if that is why they did not use checkin-test-sems.sh to push.
Send out an email to sandia-trilinos-developers to see if this failing test impacted them.

If we did into this more as a learning use case, we will create a new Trilinos GitHub issue to do so.

I will bring up how to better deal with failures like this at the next Trilinos Leaders Meeting to more quickly minimize impact on Trilinos developers and users.

DETAILS:

The fourth failure that occurred over the period 8/13/2017 - 8/28/2017 was captured in the CI iteration started at "Aug 21, 2017 - 18:20 UTC":

https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3066750

with the failing newly added test MueLu_BlockedTransfer_Tpetra_MPI_4. The commits pulled this CI iteration shown at:

https://testing.sandia.gov/cdash/viewNotes.php?buildid=3066774##note1

were:

0c7f631:  Xpetra: reduce number of for-loops in concatenateMaps()
Author: Matthias Mayr <mmayr@sandia.gov>
Date:   Fri Jun 30 10:09:28 2017 -0700

M   packages/xpetra/src/Utils/Xpetra_MapUtils.hpp

ffe7619:  Xpetra: updated doxygen documentation
Author: Matthias Mayr <mmayr@sandia.gov>
Date:   Fri Jun 30 09:57:52 2017 -0700

M   packages/xpetra/doc/Xpetra_DoxygenDocumentation.hpp

bd7e914:  MueLu: fixed compiler warning
Author: Matthias Mayr <mmayr@sandia.gov>
Date:   Fri Jun 30 09:37:08 2017 -0700

A   packages/muelu/src/Utils/MueLu_UtilitiesBase_decl.hpp.orig

130bcaf:  Xpetra: update documenation for bgs_apply
Author: Matthias Mayr <mmayr@sandia.gov>
Date:   Tue Jun 27 09:35:17 2017 -0700

M   packages/xpetra/src/BlockedCrsMatrix/Xpetra_BlockedCrsMatrix.hpp

cc0b9a7:  MueLu: Added BlockedJacobiSmoother
Author: Matthias Mayr <mmayr@sandia.gov>
Date:   Tue Jun 27 09:32:50 2017 -0700

M   packages/muelu/src/CMakeLists.txt
...
A   packages/stokhos/src/muelu/explicit_instantiation/MueLu_BlockedJacobiSmoother.cpp
...

c406050:  updated list of developers
Author: Matthias Mayr <mmayr@sandia.gov>
Date:   Tue Jun 27 08:50:14 2017 -0700

M   packages/muelu/doc/MueLu_DoxygenDocumentation.hpp

e995e6c:  Removed HAVE_MUELU_EXPERIMENTAL guard for blocked operators
Author: Matthias Mayr <mmayr@sandia.gov>
Date:   Tue May 9 14:08:57 2017 -0700

M   packages/muelu/src/CMakeLists.txt
...
M   packages/stokhos/src/muelu/explicit_instantiation/MueLu_BlockedCoarseMapFactory.cpp
...

Given that 18 new tests showed up, one would assume that the new tests were turned on by the commit "e995e6c: Removed HAVE_MUELU_EXPERIMENTAL guard for blocked operators". All of these commits would pushed at the same time in the push:

Mon Aug 21 12:19:11 MDT 2017

commit 0c7f6312ff2fe596f672ee9b771ca989ee61afe1
Author:     Matthias Mayr <mmayr@sandia.gov>
AuthorDate: Fri Jun 30 10:09:28 2017 -0700
Commit:     Matthias Mayr <mmayr@sandia.gov>
CommitDate: Mon Aug 21 11:18:20 2017 -0700

    Xpetra: reduce number of for-loops in concatenateMaps()

    Build/Test Cases Summary
    Enabled Packages: MueLu, Stokhos, Xpetra
    Enabled all Forward Packages
    0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1144,notpassed=0 (123.40 min)
    Other local commits for this build/test group: ffe7619, bd7e914, 130bcaf, cc0b9a7, c406050, e995e6c

Commits pushed:
0c7f631 Xpetra: reduce number of for-loops in concatenateMaps()
ffe7619 Xpetra: updated doxygen documentation
bd7e914 MueLu: fixed compiler warning
130bcaf Xpetra: update documenation for bgs_apply
cc0b9a7 MueLu: Added BlockedJacobiSmoother
c406050 updated list of developers
e995e6c Removed HAVE_MUELU_EXPERIMENTAL guard for blocked operators

So in this case, it looks like the checkin-test-sems.sh script was used. Yet, the test is shown as failing on this post-push CI server. Therefore, we need to dig a little deeper to see how this could happen.

The first clue that something is not quite right is that the above commit log shows only 1114 tests (all passing). But the post-push CI server shows a total of 1146 tests (1145 passing, 1 failing). So it looks like two tests were missing in the pre-push testing. How can this be? Let's dig deeper. First, note that this push does not appear to have been logged to the trilinos-checkin-test mailmain list for August shown at:

https://software.sandia.gov/pipermail/trilinos-checkin-tests/2017-August/date.html

If it would have been logged, it would have been logged in between the following two logged pushes:

Thu Aug 24 10:33:48 MDT 2017: https://software.sandia.gov/pipermail/trilinos-checkin-tests/2017-August/002503.html
Fri Aug 25 09:22:38 MDT 2017: https://software.sandia.gov/pipermail/trilinos-checkin-tests/2017-August/002504.html

It is possible that the argument --send-final-push-email-to was overridden to zero it out so we did not get that push logged. Otherwise, we would have seen what machine this push occurred on and found other details that might explain what happened.

Given that the push was not logged to the mail list, I will need to see if I can reproduce this CI build myself on my own machine crf450 which is not the same machine ceerws1113 that the post-push CI server runs on.

First, I checkout that exact version of Trilinos:

$ cd ~/Trilinos.base2/Trilinos/
$ git fetch
$ git checkout 0c7f631
Note: checking out '0c7f631'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at 0c7f631... Xpetra: reduce number of for-loops in concatenateMaps()

Then we run the checkin-test-sems.sh script to match what is shown above:

$ cd Trilinos.base2/CHECKIN/

$ ./checkin-test-sems.sh --enable-packages=MueLu,Stokhos,Xpetra --local-do-all

After the configure completed and while the build was running I ran ctest -N to see how many tests are reported:

$ ctest -N | grep "Total Tests"
Total Tests: 1146

So that shows 1146 tests, instead of the 1144 tests showed in the commit log for the above commit "0c7f631 Xpetra: reduce number of for-loops in concatenateMaps()". That makes me think that perhaps the person who ran the checkin-test-sems.sh somehow (either on accident or on purpose disabled two tests).

As I have verified that the number of total CI tests run my usage of the checkin-test-sems.sh script locally agrees with the number of post-push CI tests on CDash (1146 each), I will kill the checkin-test-sems.sh script and just run it on the MueLu test suite to see if I can reproduce this failing MueLu test:

$ checkin-test-sems.sh --no-enable-fwd-packages --enable-packages=MueLu --local-do-all

So that returned all passing tests on my machine showing:

100% tests passed, 0 tests failed out of 71

Label Time Summary:
MueLu    = 266.48 sec (74 tests)

Total Test time (real) = 192.94 sec

(NOTE: Clearly there is a defect is CTest in it claims there were 74 MueLu tests but there are only 71 total tests :-( I will check with Kitware on that.)

And looking at the MPI_RELEASE_DEBUG_SHARED_PT/ctest.out file, I see:

35/71 Test #33: MueLu_BlockedTransfer_Tpetra_MPI_4 ...........................   Passed    1.71 sec

So the test ran and it passed on my RHEL6 machine crf450 and it failed on the post-push CI build run on ceerws1113 shown at:

https://testing.sandia.gov/cdash/testDetails.php?test=40792011&build=3066774

which showed the failure:

Computing Ac (block) (MueLu::BlockedRAPFactory)
  MxM: A x P

p=0: *** Caught standard std::exception of type 'std::logic_error' :

 /scratch/rabartl/Trilinos.base/SEMSCIBuild/Trilinos/packages/tpetra/core/src/Tpetra_CrsGraph_def.hpp:860:

 Throw number = 1

 Throw test that evaluated to true: (!haveGlobalConstants_)

 Tpetra::CrsGraph<int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false>::getGlobalNumEntries(): The matrix does not have globalConstants computed, but the user has requested them.

p=2: *** Caught standard std::exception of type 'std::logic_error' :

 /scratch/rabartl/Trilinos.base/SEMSCIBuild/Trilinos/packages/tpetra/core/src/Tpetra_CrsGraph_def.hpp:860:

 Throw number = 1

 Throw test that evaluated to true: (!haveGlobalConstants_)

 Tpetra::CrsGraph<int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false>::getGlobalNumEntries(): The matrix does not have globalConstants computed, but the user has requested them.

p=3: *** Caught standard std::exception of type 'std::logic_error' :

 /scratch/rabartl/Trilinos.base/SEMSCIBuild/Trilinos/packages/tpetra/core/src/Tpetra_CrsGraph_def.hpp:860:

 Throw number = 1

 Throw test that evaluated to true: (!haveGlobalConstants_)

 Tpetra::CrsGraph<int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false>::getGlobalNumEntries(): The matrix does not have globalConstants computed, but the user has requested them.

p=1: *** Caught standard std::exception of type 'std::logic_error' :

 /scratch/rabartl/Trilinos.base/SEMSCIBuild/Trilinos/packages/tpetra/core/src/Tpetra_CrsGraph_def.hpp:860:

 Throw number = 1

 Throw test that evaluated to true: (!haveGlobalConstants_)

 Tpetra::CrsGraph<int, int, Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace>, false>::getGlobalNumEntries(): The matrix does not have globalConstants computed, but the user has requested them.
--------------------------------------------------------------------------
mpiexec noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------

This is very strange. One would need to investigate further but it appears that there was some type of undefined behavior in the code that caused it to behave differently on different RHEL machines. It would take more effort to dig into this but since it seems to be fixed now, it is likely not worth the effort to do so.

In any case, this failing test was left to linger for days and it did not get fixed until the CI iteration that started at "Aug 25, 2017 - 15:27 UTC":

https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3073547

where this test MueLu_BlockedTransfer_Tpetra_MPI_4 went from failing to passing. The commits pulled this CI iteration are show at:

https://testing.sandia.gov/cdash/viewNotes.php?buildid=3073563##note2

were:

33b1cc6:  Xpetra: Because nested BloockCrsMatrices make things more 'fun'
Author: Chris Siefert <csiefer@sandia.gov>
Date:   Fri Aug 25 08:15:02 2017 -0600

M   packages/xpetra/sup/Utils/Xpetra_MatrixMatrix.hpp

4c20362:  Xpetra: GlobalConstants call for blocked MMM
Author: Chris Siefert <csiefer@sandia.gov>
Date:   Thu Aug 24 22:36:13 2017 -0600

M   packages/xpetra/sup/Utils/Xpetra_MatrixMatrix.hpp

79ff5a8:  MueLu: muemex.cpp warning clean-up
Author: Luc Berger-Vergiat <lberge@sandia.gov>
Date:   Fri Aug 25 08:54:14 2017 -0600

M   packages/muelu/matlab/bin/muemex.cpp

d97a8dd:  MueLu: Last warning clean-up in TentativeP and Aggregation for OpenMP
Author: Luc Berger-Vergiat <lberge@sandia.gov>
Date:   Fri Aug 25 08:48:57 2017 -0600

M   packages/muelu/src/Graph/UncoupledAggregation/MueLu_AggregationPhase3Algorithm_kokkos_def.hpp
M   packages/muelu/src/Transfers/Smoothed-Aggregation/MueLu_TentativePFactory_kokkos_def.hpp
M   packages/muelu/test/unit_tests_kokkos/Aggregates_kokkos.cpp

From the summary it is not clear which commit fixed this failing test (but it would not be hard to figure out with a simple manual bisection). But looking at the push log, it seems likely that this test was fixed by the push:

Fri Aug 25 09:23:39 MDT 2017

commit 33b1cc628e01c9f6e22f6a6fd3dd72f3402ebf9f
Author:     Chris Siefert <csiefer@sandia.gov>
AuthorDate: Fri Aug 25 08:15:02 2017 -0600
Commit:     Chris Siefert <csiefer@sandia.gov>
CommitDate: Fri Aug 25 09:22:35 2017 -0600

    Xpetra: Because nested BloockCrsMatrices make things more 'fun'

    Build/Test Cases Summary
    Enabled Packages: Xpetra
    Disabled Packages: PyTrilinos,Claps,TriKota
    Enabled all Forward Packages
    0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1146,notpassed=0 (66.87 min)
    Other local commits for this build/test group: 4c20362

Commits pushed:
33b1cc6 Xpetra: Because nested BloockCrsMatrices make things more 'fun'
4c20362 Xpetra: GlobalConstants call for blocked MMM

So it looks like the CI build was broken continuously for the better part of 4 days from "Mon Aug 21 12:19:11 MDT 2017" to "Fri Aug 25 09:23:39 MDT 2017". This had to have impacted many attempted pushes.

In any case, it is interesting to see how the Trilinos development community responded to this failure. First, note that this failure was reported in a GitHub issue #1633 created on Aug 22 (yea for @william76!). So the issue got reported the day after the failure (so it was already failing for as much as as day with no action) and the MueLu developers were notified. But the test was not made to be fixed until three days later when the push noted above was performed (thanks @csiefer2!).

What is interesting is that several pushes occurred in that time period, and may of them used the checkin-test-sems.sh script and tested changes to MueLu. How did they do that? As I showed above, that test MueLu_BlockedTransfer_Tpetra_MPI_4 likely passed on some people's machines. But did it fail on other people's machines and impact their pushes?

One example we can see is the push:

Thu Aug 24 10:34:15 MDT 2017

commit bd419666ac10b8fc61304e17f534e93744a105e3
Author:     Luc Berger-Vergiat <lberge@sandia.gov>
AuthorDate: Wed Aug 23 14:45:20 2017 -0600
Commit:     Luc Berger-Vergiat <lberge@sandia.gov>
CommitDate: Thu Aug 24 10:33:39 2017 -0600

    MueLu: catching more kokkos header changes in tests

    Build/Test Cases Summary
    Enabled Packages: MueLu, Xpetra
    Disabled Packages: PyTrilinos,Claps,TriKota
    Enabled all Forward Packages
    0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1145,notpassed=0 (150.74 min)
    Other local commits for this build/test group: 50f78e1

Commits pushed:
bd41966 MueLu: catching more kokkos header changes in tests
50f78e1 MueLu: change header include Kokkos_CrsMatrix.hpp to KokkosSparse_CrsMatrix.hpp

That push was logged to the trilinos-checkin-tests email list at:

https://software.sandia.gov/pipermail/trilinos-checkin-tests/2017-August/002503.html

and that log shows:

  passed: Trilinos/MPI_RELEASE_DEBUG_SHARED_PT: passed=1145,notpassed=0

  Thu Aug 24 10:32:54 MDT 2017

  Enabled Packages: MueLu, Xpetra
  Disabled Packages: PyTrilinos,Claps,TriKota
  Enabled all Forward Packages
  Hostname: geminga.sandia.gov
  Source Dir: /home/lberge/Research/checkin/Trilinos/cmake/tribits/ci_support/../../..
  Build Dir: /home/lberge/Research/checkin/Trilinos/checkin/MPI_RELEASE_DEBUG_SHARED_PT

  CMake Cache Varibles:  ...
  Extra CMake Options: -DMueLu_BlockedTransfer_Tpetra_MPI_4_DISABLE=ON
  Make Options: -j4 
  CTest Options: -j4 

  Pull: Passed (0.00 min)
  Configure: Passed (2.65 min)
  Build: Passed (110.31 min)
  Test: Passed (37.77 min)

  100% tests passed, 0 tests failed out of 1145

See the extra CMake options -DMueLu_BlockedTransfer_Tpetra_MPI_4_DISABLE=ON? So it looks like Luc followed the instructions at:

https://github.com/trilinos/Trilinos/wiki/Policies-%7C-Safe-Checkin-Testing:disable_already_failing

and disabled this known failing test. That is great!

What about the other pushes during this time period? A grep of the push log file shows:

Fri Aug 25 09:23:39 MDT 2017
Commit:     Chris Siefert <csiefer@sandia.gov>
    0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1146,notpassed=0 (66.87 min)

Fri Aug 25 09:18:24 MDT 2017
Commit:     Luc Berger-Vergiat <lberge@sandia.gov>
    0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=447,notpassed=0 (23.42 min)

Thu Aug 24 20:44:06 MDT 2017
Commit:     Chris Siefert <csiefer@sandia.gov>

Thu Aug 24 14:19:12 MDT 2017
Commit:     Luc Berger-Vergiat <lberge@sandia.gov>
    0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=447,notpassed=0 (42.64 min)

Thu Aug 24 12:19:02 MDT 2017
Commit:     Mehmet Deveci <mndevec@sandia.gov>

Thu Aug 24 10:52:46 MDT 2017
Commit:     Mehmet Deveci <mndevec@sandia.gov>

Thu Aug 24 10:34:15 MDT 2017
Commit:     Luc Berger-Vergiat <lberge@sandia.gov>
    0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1145,notpassed=0 (150.74 min)

Thu Aug 24 09:24:18 MDT 2017
Commit:     Michael Wolf <mmwolf@sandia.gov>

Wed Aug 23 19:51:10 MDT 2017
Commit:     Christian Robert Trott (-EXP) <crtrott@sandia.gov>

Wed Aug 23 16:29:30 MDT 2017
Commit:     Tobias Wiesner <tawiesn@sandia.gov>
    0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=446,notpassed=0 (27.94 min)

Wed Aug 23 15:00:24 MDT 2017
Commit:     Bill Spotz <wfspotz@sandia.gov>

Wed Aug 23 09:06:01 MDT 2017
Commit:     Paul Wolfenbarger <prwolfe@users.noreply.github.com>

Wed Aug 23 08:31:09 MDT 2017
Commit:     Luc Berger-Vergiat <lberge@sandia.gov>
    0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=447,notpassed=0 (66.08 min)

Tue Aug 22 20:49:29 MDT 2017
Commit:     crtrott <crtrott@sandia.gov>

Tue Aug 22 18:15:37 MDT 2017
Commit:     Luc Berger-Vergiat <lberge@sandia.gov>
    0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=447,notpassed=0 (14.65 min)

Tue Aug 22 17:42:45 MDT 2017
Commit:     Matthias Mayr <mmayr@sandia.gov>
    0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=447,notpassed=0 (24.37 min)

Tue Aug 22 16:30:55 MDT 2017
Commit:     Brent Perschbacher <bmpersc@sandia.gov>

Tue Aug 22 16:02:10 MDT 2017
Commit:     Jason M. Gates <jmgate@sandia.gov>

Tue Aug 22 15:29:16 MDT 2017
Commit:     Kara Peterson <kjpeter@sandia.gov>

Tue Aug 22 15:23:02 MDT 2017
Commit:     Luc Berger-Vergiat <lberge@sandia.gov>
    0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=447,notpassed=0 (6.84 min)

Tue Aug 22 13:40:25 MDT 2017
Commit:     Andrey Prokopenko <prokopenkoav@ornl.gov>

Mon Aug 21 12:19:11 MDT 2017
Commit:     Matthias Mayr <mmayr@sandia.gov>
    0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=1144,notpassed=0 (123.40 min)

That above shows the first breaking push on "Mon Aug 21 12:19:11 MDT 2017" and the last fixing push on "Fri Aug 25 09:23:39 MDT 2017". In between, there were 21 total pushes but only 8 of these used the checkin-test-sems.sh script and it would only used by:

Luc Berger-Vergiat <lberge@sandia.gov>
Tobias Wiesner <tawiesn@sandia.gov>
Matthias Mayr <mmayr@sandia.gov>

Several of the people who pushed during that period that usually use the checkin-test-sems.sh script to push did not do so which I think includes:

Chris Siefert <csiefer@sandia.gov>
crtrott <crtrott@sandia.gov>
Andrey Prokopenko <prokopenkoav@ornl.gov>

I wonder if they did not use the checkin-test-sems.sh script to push because of this failing test so they just manually pushed? And perhaps they did not know about how to selectively disable known failing tests as described at:

https://github.com/trilinos/Trilinos/wiki/Policies-%7C-Safe-Checkin-Testing:disable_already_failing

?

It looks like that was the case for @crtrott as shown in commit 72441d507073a2929003c6e53118036c2eaa0d17.

bartlettroscoe commented 7 years ago

In a follow up to the the most recent CI build failure described above, I contacted Matthias and he said that he saw the test MueLu_BlockedTransfer_Tpetra_MPI_4 failing on his machine and he was instructed to just disable it locally and push. (That explains the reduced number of tests reported in his push). He did not realize the impact that pushing this failing test would have on the CI testing process and the impact on other developers.

To help address this I will add a new GitHub wiki page describing the proper ways to disable a given test for different use cases. I will then add a section the the checkin-test-sems.sh wiki page on the proper way to address failing tests on one's local machine before pushing so as to avoid breaking the CI (and other) builds.

bartlettroscoe commented 7 years ago

FYI: The CI build is broken due to failing test Stokhos_KokkosCrsMatrixUQPCEUnitTest_Serial_MPI_1 just pushed (see #1703). I am pushing a disable of that test for the CI build (and only the CI build).

dridzal commented 7 years ago

@bartlettroscoe : Is anyone keeping track of the tests that were disabled in the CI build? I assume that these test will be fixed at some point. Are you the point of contact for re-enabling them? Specifically, ROL has some tests that had to be disabled. I recall one (checkAlmostSureConstraint), but I no longer recall others (if any).

bartlettroscoe commented 7 years ago

@dridzal,

Is anyone keeping track of the tests that were disabled in the CI build? I assume that these test will be fixed at some point. Are you the point of contact for re-enabling them? Specifically, ROL has some tests that had to be disabled. I recall one (checkAlmostSureConstraint), but I no longer recall others (if any).

Yes, there are explicit instructions in every GitHub issue on how the person fixing the failing test should first revert the disable commit before fixing the test locally. See https://github.com/trilinos/Trilinos/issues/1703#issuecomment-327652385.

And if you want to see the current set of disabled tests in the CI build, just look at the bottom of the file:

https://github.com/trilinos/Trilinos/blob/develop/cmake/std/BasicCiTestingSettings.cmake

(That is the correct way to disable a test for only the CI build, not disabling it locally and pushing like happened recently which resulted in a failing test showing up in the post-push CI build and for everyone else.)

Note that one of those currently listed is for a ROL test that has not yet been addressed. See #1596.

The process is:

Create a new Trilinos GitHub issue for the failing test that lists the full test name(s) (so someone can search the GitHub issues and find them)
Right away, disable the test for the CI build only by adding a disable for it to cmake/std/BasicCiTestingSettings.cmake and commit the change in its own commit, referencing the GitHub Issue ID in the commit log (so it shows up in the GitHub issue and provides tracability)
Locally test that the configure passes and the test is correctly disabled then push (no need for the full checkin-test-sems.sh --do-all --push for just that change and we want it pushed fast)
Add a comment to the GitHub issue that the test is disabled for the CI build and how to revert it locally and then test and push.

Note that once we can upgrade CMake/CTest/CDash, thees disabled tests will be displayed on CDash as "Not Run" "Disabled" so you will see them there too (but they will not trigger CDash error emails). See:

https://docs.google.com/document/d/1TLHRp8eTNKw7udOhwIxrOYShXQUbxAzsXeOq5cwWnKM/edit#bookmark=kix.t29d95is04cu

dridzal commented 7 years ago

Great!

tjfulle commented 7 years ago

@bartlettroscoe, this may be addressed elsewhere, but is there a best practice for the case that I want to test on multiple machines before pushing (or opening a PR)? For example, with Tpetra, I want to make sure that the standard checkin tests pass on my RHEL blade (obviously), but also different builds on different architectures. In this case, I may push to a branch on my fork and then update and run tests on the several different machines from that branch. The only way (that I know of) to then report test results is to amend the last commit manually with test results and force push. Perhaps there is a better way?

mhoemmen commented 7 years ago

@tjfulle I like your thinking :-) . Right now, one must add text manually to the commit message, explaining what tests passed where. We don't necessarily want to annotate every commit with all the information needed to replicate a test, but some brief mention that (e.g.,) it was tested with CUDA in a debug build could be nice.

bartlettroscoe commented 7 years ago

@tjfulle:

is there a best practice for the case that I want to test on multiple machines before pushing (or opening a PR)? For example, with Tpetra, I want to make sure that the standard checkin tests pass on my RHEL blade (obviously), but also different builds on different architectures. In this case, I may push to a branch on my fork and then update and run tests on the several different machines from that branch. The only way (that I know of) to then report test results is to amend the last commit manually with test results and force push. Perhaps there is a better way?

@mhoemmen:

I like your thinking :-) . Right now, one must add text manually to the commit message, explaining what tests passed where. We don't necessarily want to annotate every commit with all the information needed to replicate a test, but some brief mention that (e.g.,) it was tested with CUDA in a debug build could be nice.

That would not be too hard to do with crafty usage of the checkin-test.py script. You could run the checkin-test.py script on each machine separately (e.g. with --local-do-all) for special --extra-st-builds=<specialBuildNamei> and then you could copy a subset of the <specialBuildNamei>/*.out files and all of the <specialBuildNamei>/*.success files to your CEE LAN machine and then the checkin-test-sems.sh script could be run with the full list of --extra-st-builds=<specialBuildName1>,<specialBuildName2>,... and it would correctly list those extra builds and would amend the top commit message with all of those builds and would also archive the details of those builds to the trilinos-checkin-tests email list on push. If you combine the moving of a branch and remote run approach demonstrated in remote-pull-test-push.sh and combine that with the aggregation of multiple runs of the checkin-test.py script in checkin-test-crf450-cmake-2.8.11.sh and add some scp commands to copy files back, then you basically have it. One could even write some reusable utility scripts to help drive a process like this so many developers could use it.

But to go any further, we should discuss this more in a separate GitHub issue (since this issue is focused on monitoring the standard CI build).

bartlettroscoe commented 7 years ago

@ibaned pointed out a case where a broken test stopped a push sometime around July 17 (see https://github.com/trilinos/Trilinos/issues/1511#issuecomment-328863552). Unfortunately, CDash only keeps 6 weeks of results so we can't even see that.

mhoemmen commented 7 years ago

@bartlettroscoe wrote:

But to go any further, we should discuss this more in a separate GitHub issue (since this issue is focused on monitoring the standard CI build).

I opened a new issue, #1725, to discuss this. Thanks!

bartlettroscoe commented 7 years ago

@ibaned pointed out a case where a broken test stopped a push sometime around July 17 (see #1511 (comment)). Unfortunately, CDash only keeps 6 weeks of results so we can't even see that.

Okay, I misinterpreted that comment in #1511. The issue is that the test Teko_testdriver_tpetra_MPI_1 seems to be passing on some machines and failing on others. That is currently blocking a push for Kokkos in #1721 on @crtrott's machine.

We need to get to the bottom of why this test may not be passing on @crtrott's machine but passes on all of the others involved in Trilinos automated testing.

Note, an issue like this impacts every testing process you can possibly image with a heterogeneous set of machines like we have with Trilinos developers. Even with automated PR testing (i.e. #1155), people still need to be able to reproduce failing builds and tests consistently across machines if possible.

bartlettroscoe commented 7 years ago

FYI: The machine that runs the standard CI build ceerws1113 will be done from at least 4 pm MDT on 9/15/2017 to at least 6 pm MDT on 9/16/2017. Therefore, Murphy's law for software says that the CI build will be broken when it starts back up :-)

bartlettroscoe commented 7 years ago

FYI: There were a lot of test failures in the CI build this morning shown at:

https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3109695

This appears to have been due to an env problem of some type breaking MPI exec. For example, for the failing test ThyraCore_test_std_ops_serial_MPI_1, it showed the failure:

-------------------------------------------------------------------------
Open MPI was unable to obtain the username in order to create a path
for its required temporary directories.  This type of error is usually
caused by a transient failure of network-based authentication services
(e.g., LDAP or NIS failure due to network congestion), but can also be
an indication of system misconfiguration.

Please consult your system administrator about these issues and try
again.
--------------------------------------------------------------------------
[ceerws1113:62121] [[151,0],0] ORTE_ERROR_LOG: Out of resource in file util/session_dir.c at line 390
[ceerws1113:62121] [[151,0],0] ORTE_ERROR_LOG: Out of resource in file ess_hnp_module.c at line 320
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_session_dir failed
  --> Returned value Out of resource (-2) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[ceerws1113:62121] [[151,0],0] ORTE_ERROR_LOG: Out of resource in file runtime/orte_init.c at line 128
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_set_name failed
  --> Returned value Out of resource (-2) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[ceerws1113:62121] [[151,0],0] ORTE_ERROR_LOG: Out of resource in file orterun.c at line 694

But these failures went away after the Thyra tests ran.

Therefore, this was not a problem with the code but likely a problem with the CEE LAN env of some type that is used on the machine ceerws1113 that is used to run the standard CI build.

bartlettroscoe commented 7 years ago

This is a brief comment log some stats about this CI process using the checkin-test.py script in Trilinos. How may people used the checkin-test-sems.sh script to push this year (FY17) as compared to last year (FY16)? I will go with today, 9/16 each year as the boundary.

$ git shortlog -ns --grep="Build/Test Cases Summary"  --after="9/16/2015"  --before="9/16/2016" | wc -l
31
$ git shortlog -ns --grep="Build/Test Cases Summary"  --after="9/16/2016"  --before="9/16/2017" | wc -l
34

Wow, that looks like no improvement at all. So really not that many more people have used the checkin-test.py script to push in FY17 as compared to FY16. This is surprising. But how many people enabled all downstream packages? So let's look at that:

[rabartl@crf450 Trilinos (develop)]$  git shortlog -ns --grep="Enabled all Forward Packages"  --after="9/16/2015"  --before="9/16/2016" | wc -l
29
[rabartl@crf450 Trilinos (develop)]$  git shortlog -ns --grep="Enabled all Forward Packages"  --after="9/16/2016"  --before="9/16/2017" | wc -l
31

Boy, so that is not that different either.

The real metric that I would like to get is what fractions of pushes that modified source code files in a Trilinos package used the checkin-test.py script to push? Unfortunately, I don't know of any way to generate those statistics before I started logging pushes to Trilinos on May 26 as part of #1362. That is because GitHub does not give you even push stats not to mention the actual push info (like I am collecting now).

And we also can't look at CDash history for any data because it only keeps 6 weeks of history. So we can't say anything about stability or improved productivity (other than developer's and customer's anecdotal statements).

So we have no way to get relevant metrics about the stability of Trilinos or improved usage of the checkin-test-sems.sh script :-( It is a basic premise of empirical software engineering that you need good metrics if you are going to know if changes are making things better or worse. It looks like we just can't get that data for Trilinos (at least not by looking directly at Trilnos itself; perhaps customers could do better).

Hopefully all of this will be unimportant once an effective pull-request based testing and integration process is fully implemented and enforced (i.e. #1155). But even, then we need metrics to know how well that is working. What will those metrics be?

ibaned commented 7 years ago

What will those metrics be?

One simple metric I can think of is number of instances where downstream apps find issues. A concrete example is to create a few extra issues labels besides "bug", namely "compile error (Trilinos)", "compile error (application)", "compile warning (Trilinos)", "compile warning (application)", "test failure (Trilinos)", "test failure (application)". Then we can collect statistics on how many issues with each label were opened in a particular period of time. I would also expand "(Trilinos)" to be either "(Trilinos/develop)" or "(Trilinos/master)". Another useful statistic would be, for each such issue, time between opening and closing. If a problem with Trilinos master is found, the issue cannot be closed until the fix reaches the master branch.

bartlettroscoe commented 7 years ago

One simple metric I can think of is number of instances where downstream apps find issues. A concrete example is to create a few extra issues labels besides "bug", namely "compile error (Trilinos)", "compile error (application)", "compile warning (Trilinos)", "compile warning (application)", "test failure (Trilinos)", "test failure (application)". Then we can collect statistics on how many issues with each label were opened in a particular period of time.

Those would be great. The main challenge with that is that is getting people to actually remember to add the right labels. In my experience, metrics that require a bunch of people to remember to do something will be very incomplete. If possible, it is much better if we can collect metrics that don't require any specific action by anyone (other than the work to set up automated metrics extraction and archiving). It would be best if we could directly monitor the customer application's integration processes with Trilinos and record how frequently they are broken and for how long. But even that would be hard to interpret because "broken" means different things depending on integration model a customer app has chosen (e.g. directly pulling from 'develop' like EMPIRE developers currently do or keeping a seprate repo clone and only updating Trilinos if everything passes for the APP like SPARC or SIERRA).

mhoemmen commented 7 years ago

@bartlettroscoe Your remote check-in test script has been fantastic for me! I love the "fire and forget" feature. I use it for nearly every commit -- the only exceptions have been "emergency" pushes to fix a known build issue.

I would consider adding labels part of triaging a new issue. I think it's good hygiene for Trilinos developers to go through issues now and then, and add labels as appropriate.

ibaned commented 7 years ago

it is much better if we can collect metrics that don't require any specific action by anyone (other than the work to set up automated metrics extraction and archiving)

While I agree this is true in the very long run, I think the work to set up automated extraction is quite daunting and unlikely to reach the level of completeness that we can achieve manually unless a lot of dedicated funding is poured into it.

hard to interpret because "broken" means different things depending on integration model a customer app has chosen

This is part of the reason extraction is so hard also, is because each application has such a different infrastructure and practices for testing. What all applications have in common is that they have to get in touch with Trilinos developers to fix any problem, and I think most of that already flows through GitHub. An additional benefit of the manual approach is we will catch reports from users outside our organization, which are still significant compared to internal reports.

The main challenge with that is that is getting people to actually remember to add the right labels.

While there will be a misstep or two inevitably, the Kokkos team has had good success adding labels that have very strict meaning and play a part in an automated workflow (our InDevelop label indicates a fix has been pushed to the develop branch, and all such issues are automatically closed when develop is merged to master).

bartlettroscoe commented 7 years ago

While there will be a misstep or two inevitably, the Kokkos team has had good success adding labels that have very strict meaning and play a part in an automated workflow (our InDevelop label indicates a fix has been pushed to the develop branch, and all such issues are automatically closed when develop is merged to master).

Okay, I am convinced. Let's create a separate GitHub issue calling something like "Define labels and rules for Trilinos application issues and metrics" and then we can discuss it more there and bring in Mike H. , Jim W., and other interested individuals. But before we define any more labels I think we need to better organize them along the lines of #1619.

ibaned commented 7 years ago

The merge of PR #1563 caused a compile error in the standard SEMS build used for checkin, which blocked the checkin of #1767. The details are logged as issue #1772.

bartlettroscoe commented 7 years ago

@ibaned,

The merge of PR #1563 caused a compile error in the standard SEMS build used for checkin, which blocked the checkin of #1767. The details are logged as issue #1772.

Thanks for catching this so fast! You reported this more that 2.5 hours before the post-push CI build showed this failure (because it was already processing an earlier push that was pretty expensive to rebuild).

bartlettroscoe commented 7 years ago

FYI: I reverted the bad merge commit referenced in #1772 and it looks like the standard CI build is clean again this morning (and hopefully it got reverted in time not to blow up the various Nightly builds). I provided instructions in #1772 on how to fix go about fixing this and then trying to merge, test, and push again (this time using the checkin-test-sems.sh script to avoid breaking the standard CI build).

bartlettroscoe commented 7 years ago

FYI: None of the CI builds are showing up today. See #1880.

bartlettroscoe commented 7 years ago

FYI: There have been random looking build failures showing up on the CI build running on ceerws1113 starting Friday night shown at:

https://testing.sandia.gov/cdash/index.php?project=Trilinos&filtercount=3&showfilters=1&filtercombine=and&field1=buildname&compare1=66&value1=-MPI_RELEASE_DEBUG_SHARED_PT_CI&field2=groupname&compare2=61&value2=Continuous&field3=buildstarttime&compare3=84&value3=now

These look like disk write failures. I have disabled the CI build on ceerws1113 until I can determine what is happening. (I am running df -h but it is hanging.)

bartlettroscoe commented 7 years ago

FYI:

This morning df -h is not hanging anymore and it shows that /scratch has 634G of free space. I have restarted the CI server on ceerws1113 and it is posting to CDash at:

https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3190720&filtercount=3&showfilters=1&field1=groupname&compare1=61&value1=Continuous&field2=buildstarttime&compare2=84&value2=now&filtercombine=and

We will see what happens from here. If I see any of the same types of system-type failures, I will kill the CI server and investigate further.

bartlettroscoe commented 7 years ago

FYI: The CI test suites for Tpetra and Xpetra are currently broken as described in #1929. Therefore, if you are pushing changes to Tpetra or Xpetra or packages upstream from these, these will block you push using the checkin-test-sems.sh script. (Or you can locally disable these tests as described at https://github.com/trilinos/Trilinos/wiki/Policies-%7C-Safe-Checkin-Testing#disable_already_failing). I will try to back out these commits today if these are not fixed in the next hour or so.

mhoemmen commented 7 years ago

@bartlettroscoe Please feel free to revert the commits if they are hindering progress. Thanks!

bartlettroscoe commented 7 years ago

FYI: The CI server on ceerws1113 is showing catastrophic failures again this morning (see #1932). I have killed the CI server and will investigate more carefully this time (see #1932 for details).

bartlettroscoe commented 7 years ago

FYI: I reverted the not-ready-for-prime-time commits described in #1929 and I manually restarted the CI server on ceerws1113 (and will catch carefully to see that is shuts down tonight in #1932). We should hopefully see a 100% clean CI build again (and pushes should not be stopped right now either).

bartlettroscoe commented 7 years ago

FYI: Restarted completed CI build was 100% clean:

https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3194851&filtercount=3&showfilters=1&field1=groupname&compare1=61&value1=Continuous&field2=buildstarttime&compare2=84&value2=now&filtercombine=and

bartlettroscoe commented 7 years ago

FYI: Both of the CI builds were broken last night (two failing SEACAS tests) due to a push last night (see details at #2039). I am in the process of disabling these two tests for the CI build. But this should not impact people's pushes with chekcin-test-sems.sh or with the automated PR testing unless they are triggering the enable of SEACAS tests.

trilinos / Trilinos

Address basic stability of Trilinos 'develop' branch short-term #1304

1625