trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.21k stars 564 forks source link

Set up robust portable pre-push and post-push CI tools and process based on the SEMS Dev Env #482

Closed bartlettroscoe closed 7 years ago

bartlettroscoe commented 8 years ago

Next Action Status:

New CI build is pushed to 'develop', new post-push CI server is running, and new checkin-test-sem.sh script ready for more testing and review ... Note going to pursue other extensions (e.g. mac OSX, tcsh, etc.). See https://github.com/trilinos/Trilinos/issues/482#issuecomment-266124179. Next: Leave in review til 1/1/2017 then close.

Blocked By: #158, #410, #362

Blocking: #380

Related To: #370, #475, #476

CC: @trilinos/framework

Description:

Trilinos has not had an effective pre-push CI development process for many years. When the checkin-test.py script was first created (back in 2008 or so), the primary stack of packages was based on Epetra and the main external dependencies were C/C++/Fortran compilers and BLAS and LAPACK. Those dependencies and the major Trilinos customers at the time were used to select the initial set of Primary Tested (initially called Primary Stable) packages that is being used to this day. However, since that time, many new Trilinos packages have been added and important Trilinos customers are relying on many of these newer packages (e.g. SEACAS, STK, Tpetra, Phalanx, Panzer, etc.). In addition, these new Trilinos packages require more dependencies than just BLAS and LAPACK and now TPLs like Boost, HDF5, NetCDF, ParMETIS, SuperLU and others used by Trilinos are also very important to many Trilinos customers.

Another problem with the current pre-push CI testing processes with Trilinos is that Trilinos developers have a variety of different types of machines, OSs, versions of compilers, TPL implementations, etc. that they use to develop on and push changes for Trilinos. This has resulted in people who tried to use the checkin-test.py script to suffer failed pushes due to failing tests on their machine not triggered by their changes. In contract, projects that have a uniform pre-push CI testing env don't experience these types of problems. One example of such a project is CASL VERA that uses TriBITS and the checkin-test.py script and has a set of uniform development machines where developers almost never see tests that fail in their build of the code that passed on another developer's build. Therefore, the only failed builds and tests are due to their own local changes. In that project, there is no trepidation to running the checkin-test.py script and everyone uses it uniformly for nearly every push.

Another problem with the current CI testing process for Trilinos is that the post-push CI server that posts to CDash enables a different set of packages and TPLs from what the pre-push CI build does (and of course uses different compilers, MPI, etc.). Therefore, a CI build/test failure seen on CDash may not be seen with the checkin-test.py script locally and visa vera. This makes it difficult for developers to determine if the failures they are seeing on their own machine are due to their local changes or due differences with the env on their machine compared to the machine running the CI build posting to CDash, if it is due to a different set of enabled packages and TPL or something else.

As a result, the stability of the main Trilinos development branch (now the 'develop' branch, see #370) has degraded from what it was 5+ years ago. This is a problem because Trilinos needs to have a more stable 'develop' branch in order to more frequently update from the 'develop' branch to the 'master' branch (see #370).

This story is to address all of these shortcomings of the current Trilinos CI testing process. The new SEMS Dev Env (#158) provides an opportunity to create a fairly portable (at least for SNL staff members) uniform pre-push and post-push CI testing environment for the first time.

Here is the plan for setting up a more effective CI process based on the SEMS Dev Env, the checkin-test.py script, and CTest/CDash:

  1. Select a standard pre-push CI build env based on the SEMS Dev Env: Currently, GCC 4.7.2 and OpenMPI 1.6.5 are being used for the post-push CI build that posts to CDash. These selections should be reexamined and potentially changed. This will be used to create a standard load_ci_sems_dev_env.sh script, which just calls the local_sems_dev_env.sh script with the selections.
  2. Select an expanded/revised set of Primary Tested (PT) packages and TPLs: This revised set should be based on the most important packages and TPLs to current Trilinos customers. Any important TPL not already supported by the SEMS Dev Env may need to be added (i.e. to the Trilinos space under the /projects/ NFS mount). Revising the set of PT packages and TPLs is being addressed in #410.
  3. Set up a standard checkin-test-sems.sh script that all Trilinos developers can use to push changes to the Trilinos 'develop' branch: This should automatically load the correct standard SEMS Dev Env by sourcing load_ci_sems_dev_env.sh. This should likely only run a single build of Trilinos to speed up the testing/push process. (If there is a single build is would likely include -DTPL_ENABLE_MPI=ON -DCMAKE_BULD_TYPE=RELEASE -DTriinos_ENABLE_DEBUG=ON -DBUILD_SHARED_LIBS=ON -DTrilinos_ENABLE_EXPLICIT_INSTANTIATION=ON -DTrilinos_ENABLE_FLOAT=OFF -DTrilinos_ENABLE_COMPLEX=OFF. See #362 about turning off float and complex gy default.)
  4. Change the main post-push CI server that posts to CDash to use the exact same build as the default builds for the checkin-test-sems.sh script: This is needed to catch the violations of the additive test assumption of branches. This can also be used to alert Trilinos developers when there are failures in the standard CI build or to verify that failures they are seeing are not their doing. If other post-push CI builds are desired, like non-MPI serial and full release builds, then those can be added as extra CI builds (we just need extra machines for that).

After this Story is complete, then we can create new Stories to get Trilinos developers to use the checkin-test-sems.sh script and to commit to keeping the CI build(s) 100% all the time with "Stop the Line" urgency to fix.

Definition of Done:

Decisions that need to be made:

Tasks:

  1. Create drafts for load_ci_sems_dev_env.sh and checkin-test-sems.sh [Done]
  2. Discuss this Story at a Trilinos Leaders Meeting Done]
  3. Work #410 to select the updated set of PT packages and TPLs [Done]
  4. Work #362 to disable float and complex by default [Done]
  5. Select the new set (or just one) --default-builds for the checkin-test.py and therefore the checkin-test-sems.sh script" [Done]
    • Make updates to Trilinos and checkin-test.py script on branch better-ci-build-482 ... IN PROGRESS ...
    • Get proposed changes reviewed (quickly) [Done]
    • Create wiki documentation for usage checkin-test-sems.sh [Done]
    • Commit changes to 'develop' branch [Done]
  6. Create a new post-push CI build on crf450 that uses the identical CI build as checkint-test-sems.py --local-do-all [Done]
    • Set up cron job or Jenkins job to run the build [Done]
    • Run the CI build for several days and have people review it [Done]
  7. Have updated CI process and documentation reviewed ... In Progress ...
  8. Update the existing Jenkins CI build to use the new CI build and then remove the CI build on crf450 ...
bmpersc commented 7 years ago

@tjfulle which SEMS parmetis module were you using? There are 3 versions right now, one that is explicitly 64 bit reals, one that is explicitly 32 bit reals, and one that is undecorated. The undecorated version has some issues with the way it was built not being consistent. This inconsistency is mostly exposed when linked with scotch. The explicitly 64 or 32 bit versions were made specifically to address this and are the only versions that should be used and if using scotch as well you need the similarly decorated version for consistency.

bartlettroscoe commented 7 years ago

FYI: I tried to run the checkin-test-sems.sh script on the branch better-ci-build on the CEE LAN machine ceesrv02 and it configure and was building just fine (abet very slowly) until it ran out of disk space?

...
/tmp/cchke542.s:3174809: Fatal error: can't close CMakeFiles/PanzerAdaptersSTK_tScatterResidual.dir/scatter_residual.cpp.o: No space left on device
...

Anyway, I suspect that the checkin-test-sems.sh script will run right out of the box on any CEE Linux machine.

bartlettroscoe commented 7 years ago

There are 3 versions right now, one that is explicitly 64 bit reals, one that is explicitly 32 bit reals, and one that is undecorated.

So which one of these for ParMETIS and Scotch that will work with Trilinos consistently on all these platforms?

tjfulle commented 7 years ago

@bmpersc I'm not using the SEMS modules. @bartlettroscoe showed that many Muelu tests fail using the default SEMS modules. I believe the default parmetis is 64 bit real/int. I got identical results on my machine. The tests pass after I built metis with 32 bit reals (keeping 64 bit integer)

I'm not sure what you mean by non decorated, metis.h requires that IDXTYPEWIDTH and REALTYPEWIDTH be 32 or 64. By default they are both 32.

tjfulle commented 7 years ago

@bmpersc and @bartlettroscoe, to my understanding, the distinction between the 32 and 64 bit builds of metis/parmetis is in how IDXTYPEWIDTH and REALTYPEWIDTH are defined in metis.h. They must be 32 or 64. The three parmetis modules in SEMS are labeled 32bit_parallel, 64bit_parallel, and parallel. A diff of metis.h in 32bit_parallel and 64bit_parallel gives:

33c33
< #define IDXTYPEWIDTH 32
---
> #define IDXTYPEWIDTH 64
43c43
< #define REALTYPEWIDTH 32
---
> #define REALTYPEWIDTH 64

as expected. There is no difference between the 64bit_parallel and parallel metis.h, suggesting that there are really only 2 versions of parmetis.

This is all for Darwin. I don't have access right now to the NFS SEMS mount to check the Linux versions.

Defining IDXTYPEWIDTH=64 and REALTYPEWIDTH=32 allows the previously mentioned MueLu tests to pass for me on Darwin.

bmpersc commented 7 years ago

@bartlettroscoe both of the versions that explicitly state the 32 or 64 bit work, that was the point. The sems version that doesn't have that decoration wasn't built correctly and had to be replaced. Techinically the parmetis wasn't the problem, it was the the scotch and parmetis weren't built in a compatible way so when linked together they caused problems for a small set of codes. However, to make it clear which parmetis and scotch to use together we rebuilt both in a consistent manner.

@tjfulle you are correct that the differences are fairly minor and only indicate a difference is type sizes. As you can see SEMS does have a version built with 32 bit reals. Can you try using that version to see if you can repeat your success with MueLu tests with that?

tjfulle commented 7 years ago

@bmpersc, I can try out the 32 bit version, but I'm at a conference so it'll be a couple days. Perhaps @bartlettroscoe could sooner?

tjfulle commented 7 years ago

@bartlettroscoe does the machine you are testing on have a version of libparmetis in /use/local/lib?

bartlettroscoe commented 7 years ago

I was just reminded today why people should not be testing and pushing directly from Mac OSX. That is because the OSX file system is not case sensitive. For example, it has happened many times on many projects that someone used the wrong case for an include file name in a #include "file_name.hpp" directive or for a source file in a CMakeLists.txt file and it worked just fine on OSX. But after they pushed and someone on Linux pulled it, they got a broken build.

Therefore, for that reason alone, I would argue that on one should be directly testing and pushing directly from a Mac OSX. What do other people think?

ibaned commented 7 years ago

The OSX file system can be configured to be case-sensitive. That could be an added requirement for a machine to push. It is also relatively easy for automated tools to check whether the filesystem is case-insensitive.

bartlettroscoe commented 7 years ago

The OSX file system can be configured to be case-sensitive.

That breaks the SNL OSX machines. Ask @rppawlo

kddevin commented 7 years ago

As I sit in a presentation room today at SC16 and look around me, I see that roughly 85% of the laptops people are using are Mac laptops. Granted, that isn't a scientific sampling, but it does show that Macs are prevalent in our community. Thus, I think we should not restrict developers from using Macs for their testing and development. I would want to know precisely how many build errors case insensitivity has caused in the last year (e.g., how severe a problem this is in practice) before accepting a "solution" that excludes such a commonly used platform.

tjfulle commented 7 years ago

Wouldn't encouraging developers to use every combination of platform/environment/compiler they can reasonably use increase code robustness? Sure, there might be some pushes to the develop branch from one platform that might temporarily break a build on another, but I imagine those would be fixed pretty quickly. The end result would be more confidence in the master branch (which should not have any "broken" pushes) because it would have been tested in more environments.

tjfulle commented 7 years ago

@bartlettroscoe wrote:

I was just reminded today why people should not be testing and pushing directly from Mac OSX. That is because the OSX file system is not case sensitive. For example, it has happened many times on many projects that someone used the wrong case for an include file name in a #include "file_name.hpp" directive or for a source file in a CMakeLists.txt file and it worked just fine on OSX. But after they pushed and someone on Linux pulled it, they got a broken build.

Therefore, for that reason alone, I would argue that on one should be directly testing and pushing directly from a Mac OSX. What do other people think?

On the other hand, testing on case insensitive file systems will weed out cases where developers write files with the same name but different case. I still see it as a long term win/win if more platforms/compilers/environments are used to develop/test/push.

bartlettroscoe commented 7 years ago

I think we should not restrict developers from using Macs for their testing and development.

That is not what we are doing here at all. We are just saying, that if we are going to pick one build on one platform to best protect the productivity of all developers and all customers on all platforms, then the best build for doing that is an MPI build (with certain options set) on Linux. For example, case sensitivity errors are caught on Linux but not Mac. And most Trilinos customers are running on Linux not Mac. What we would like to do is to set things up so that if our CI build passed on Linux then there is a high degree of probability that it will also pass on OSX (perhaps with a standard set of compilers, MPI, TPLs, etc.). The major internal customer (that I can't name here) integration effort almost dictates that that one build is a GCC 4.7.2 build on Linux (again, if we are going to pick just one build on one platform for pre-push CI).

Of course we want to support developers on Macs as well. But if making sure that Mac developers don't pull code that is broken on their machine is a super high priority, then we need to consider the git.git workflow mentioned in option-3 in the above comment. That is how you do it. But we can start by adding some automated testing for Mac OSX. Not one such test currently exists (see above).

Wouldn't encouraging developers to use every combination of platform/environment/compiler they can reasonably use increase code robustness?

Yes if use that other development but not for final testing and pushing. For the final test and push, you need to test everything impacted and ensure that you are not adding any new regressions to the offical CI build. Because of the fact that Trilinos tests and even builds of packages are often broken (because people are only testing on their platform before they push or don't run the checkin-test.py script even on Linux), then developers are less likely to enable and test downstream packages and that increases the chances of breaking them even more and it goes down hill from there. That makes the code less robust, not more robust. To make the code more portable, we have a set of Nightly builds that tests that.

Sure, there might be some pushes to the develop branch from one platform that might temporarily break a build on another, but I imagine those would be fixed pretty quickly.

No, they can go on for months. For example, see #826. The idea that "people will fix it quickly" creates a nighmare that is described in:

(a must read for anyone interested in this topic).

I think the issues involved here are more involved that should be discussed in a GitHub issue. For people who are interested, I would encourage people to read the document "Design Patters for Git Workflows" that I have been working on that discusses all of these issues in detail. See:

bartlettroscoe commented 7 years ago

On the other hand, testing on case insensitive file systems will weed out cases where developers write files with the same name but different case. I still see it as a long term win/win if more platforms/compilers/environments are used to develop/test/push.

That is an argument for testing on both Linux and OSX before merging to the 'develop' branch. That takes us naturally to the git.git workflow mentioned in option-3 or a sophisticated PR-based testing infrastructure like used for SST or MOOSE mention in option-4 in the above comment. Again, option-3 requires almost no infrastructure but is very labor intensive while option-4 requires more infrastructure but is more push-button from developers (until they need to reproduce failures that they can't reproduce on their own machine).

Again, let's discuss this in more detail at the next Trilinos Leaders Meeting.

srajama1 commented 7 years ago

@bartlettroscoe : I am not sure I agree with your assessment of Option 4. Can we "borrow" some of the tools from SST ? From my understanding of SST presentation, it took six months to set this up. I can't understand why one would need new machines, but machines are cheaper compared to the productivity improvement for every Trilinos developer. It is really hard to we favor a manual process over Option 4.

bartlettroscoe commented 7 years ago

I went back to GCC 4.7.2 for the Linux CI build on branch better-ci-build-482. I ran this CI build from scratch on 16 processors with:

$ ./checkin-test-sems.sh -j16 --enable-all-packages=on --local-do-all --wipe-clean

This produced the following email:

READY TO PUSH: Trilinos: crf450.srn.sandia.gov

Wed Nov 16 10:31:56 MST 2016

Enabled Packages: 
Disabled Packages: PyTrilinos,Claps,TriKota Enabled all Packages

Build test results:
-------------------
0) MPI_RELEASE_DEBUG_SHARED => passed: passed=2286,notpassed=0 (81.12 min)

*** Commits for repo :

0) MPI_RELEASE_DEBUG_SHARED Results:
------------------------------------

  passed: Trilinos/MPI_RELEASE_DEBUG_SHARED: passed=2286,notpassed=0

  Wed Nov 16 10:31:56 MST 2016

  Enabled Packages: 
  Disabled Packages: PyTrilinos,Claps,TriKota
  Enabled all Packages
  Hostname: crf450.srn.sandia.gov
  Source Dir: /home/rabartl/Trilinos.base/Trilinos
  Build Dir: /home/rabartl/Trilinos.base/BUILDS/CHECKIN/MPI_RELEASE_DEBUG_SHARED

  CMake Cache Varibles: -DTrilinos_TRIBITS_DIR:PATH=/home/rabartl/Trilinos.base/Trilinos/cmake/tribits -DTrilinos_ENABLE_TESTS:BOOL=ON -DTrilinos_TEST_CATEGORIES:STRING=BASIC -DTrilinos_ALLOW_NO_PACKAGES:BOOL=OFF -DDART_TESTING_TIMEOUT:STRING=300.0 -DBUILD_SHARED_LIBS=ON -DTrilinos_DISABLE_ENABLED_FORWARD_DEP_PACKAGES=ON -DTrilinos_TRACE_ADD_TEST=ON -DTrilinos_ENABLE_SECONDARY_TESTED_CODE:BOOL=OFF -DTPL_ENABLE_MPI=ON -DCMAKE_BUILD_TYPE=RELEASE -DTrilinos_ENABLE_DEBUG=ON -DBUILD_SHARED_LIBS=ON -DTrilinos_ENABLE_DEBUG_SYMBOLS=ON -DTrilinos_ENABLE_CI_TEST_MODE=ON -DTrilinos_ENABLE_EXPLICIT_INSTANTIATION=ON -DTrilinos_ENABLE_SECONDARY_TESTED_CODE=OFF -DTrilinos_ENABLE_TESTS=ON -DTeuchos_ENABLE_DEFAULT_STACKTRACE=OFF -DTrilinos_DISABLE_ENABLED_FORWARD_DEP_PACKAGES=ON -DTrilinos_TRACE_ADD_TEST=ON -DTrilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=ON -DTrilinos_ENABLE_ALL_PACKAGES:BOOL=ON -DTrilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES:BOOL=ON -DTrilinos_ENABLE_PyTrilinos:BOOL=OFF -DTrilinos_ENABLE_Claps:BOOL=OFF -DTrilinos_ENABLE_TriKota:BOOL=OFF
  Make Options: -j16
  CTest Options: -j16 

  Pull: Not Performed
  Configure: Passed (2.45 min)
  Build: Passed (67.11 min)
  Test: Passed (11.57 min)

  100% tests passed, 0 tests failed out of 2286

  Label Time Summary:
  Amesos               =  18.82 sec (13 tests)
  Amesos2              =   8.88 sec (7 tests)
  Anasazi              = 103.95 sec (71 tests)
  AztecOO              =  17.36 sec (17 tests)
  Belos                =  92.98 sec (61 tests)
  Domi                 = 157.07 sec (125 tests)
  Epetra               =  50.73 sec (61 tests)
  EpetraExt            =  13.49 sec (10 tests)
  FEI                  =  41.77 sec (43 tests)
  Galeri               =   4.58 sec (9 tests)
  GlobiPack            =   1.70 sec (6 tests)
  Ifpack               =  62.90 sec (53 tests)
  Ifpack2              =  47.16 sec (32 tests)
  Intrepid             = 202.79 sec (152 tests)
  Intrepid2            = 106.84 sec (107 tests)
  Isorropia            =   8.45 sec (6 tests)
  Kokkos               = 255.67 sec (21 tests)
  ML                   =  47.49 sec (34 tests)
  MueLu                = 259.34 sec (54 tests)
  NOX                  = 136.16 sec (100 tests)
  OptiPack             =   6.55 sec (5 tests)
  Panzer               = 266.06 sec (125 tests)
  Phalanx              =   3.63 sec (15 tests)
  Pike                 =   3.35 sec (7 tests)
  Piro                 =  25.29 sec (11 tests)
  ROL                  = 679.04 sec (112 tests)
  RTOp                 =  14.04 sec (24 tests)
  Rythmos              = 160.33 sec (83 tests)
  SEACAS               =   6.25 sec (8 tests)
  STK                  =  12.77 sec (12 tests)
  Sacado               =  98.58 sec (290 tests)
  Shards               =   1.16 sec (4 tests)
  ShyLU                =   8.41 sec (5 tests)
  Stokhos              = 104.20 sec (74 tests)
  Stratimikos          =  30.69 sec (39 tests)
  Teko                 = 201.45 sec (19 tests)
  Teuchos              =  54.40 sec (123 tests)
  ThreadPool           =   8.16 sec (10 tests)
  Thyra                =  65.90 sec (80 tests)
  Tpetra               = 127.45 sec (122 tests)
  TrilinosCouplings    =  57.11 sec (19 tests)
  Triutils             =   2.32 sec (2 tests)
  Xpetra               =  39.54 sec (16 tests)
  Zoltan               = 197.31 sec (16 tests)
  Zoltan2              = 134.23 sec (91 tests)

  Total Test time (real) = 694.08 sec

  Total time for MPI_RELEASE_DEBUG_SHARED = 81.12 min

The detailed configure and ctest output is shown at:

I had to increase the timeout from 3 minutes to 5 minutes as a few Trilinos tests take a lot longer to run with GCC 4.7.2 than they did with GCC 5.3.0. The most expensive BASIC tests for GCC 4.7.2 are:

 968/2286 Test  #30: KokkosContainers_UnitTest_MPI_1 .......................................   Passed  166.69 sec
2267/2286 Test #2112: ROL_example_PDE-OPT_0ld_adv-diff-react_example_02_MPI_4 ..............   Passed  122.16 sec
1432/2286 Test #1403: Teko_testdriver_tpetra_MPI_1 .........................................   Passed   97.44 sec
 835/2286 Test #537: Zoltan_hg_simple_zoltan_parallel ......................................   Passed   88.00 sec
2210/2286 Test #2107: ROL_example_PDE-OPT_0ld_poisson_example_01_MPI_4 .....................   Passed   74.91 sec
 745/2286 Test #533: Zoltan_ch_simple_zoltan_parallel ......................................   Passed   66.56 sec
1968/2286 Test #1833: MueLu_ParameterListInterpreterTpetra_MPI_1 ...........................   Passed   57.31 sec
2162/2286 Test #2059: ROL_test_sol_solSROMGenerator_MPI_1 ..................................   Passed   56.74 sec
2238/2286 Test #2118: ROL_example_PDE-OPT_topo-opt_elasticity_example_01_MPI_4 .............   Passed   56.15 sec
1815/2286 Test #1402: Teko_testdriver_tpetra_MPI_4 .........................................   Passed   54.86 sec
 570/2286 Test  #31: KokkosAlgorithms_UnitTest_MPI_1 .......................................   Passed   49.49 sec
2056/2286 Test #1915: Rythmos_IntegratorBuilder_ConvergenceTest_MPI_1 ......................   Passed   45.43 sec
 526/2286 Test  #11: KokkosCore_UnitTest_Serial_MPI_1 ......................................   Passed   35.56 sec
2144/2286 Test #2084: ROL_example_parabolic-control_example_03_MPI_1 .......................   Passed   30.96 sec
2007/2286 Test #1911: Rythmos_BackwardEuler_ConvergenceTest_MPI_1 ..........................   Passed   29.24 sec
2284/2286 Test #534: Zoltan_ch_simple_parmetis_parallel ....................................   Passed   27.67 sec
2133/2286 Test #2076: ROL_example_burgers-control_example_06_MPI_1 .........................   Passed   27.12 sec
1850/2286 Test #1831: MueLu_ParameterListInterpreterEpetra_MPI_1 ...........................   Passed   26.98 sec
2233/2286 Test #2119: ROL_example_PDE-OPT_topo-opt_poisson_example_01_MPI_4 ................   Passed   25.34 sec
2279/2286 Test #2245: PanzerAdaptersSTK_MixedPoissonExample-ConvTest .......................   Passed   22.84 sec
1838/2286 Test #1824: MueLu_UnitTestsTpetra_MPI_1 ..........................................   Passed   21.49 sec
1199/2286 Test #1475: Intrepid_test_Discretization_Basis_HGRAD_TRI_Cn_FEM_Test_02_MPI_1 ....   Passed   21.42 sec
2107/2286 Test #2063: ROL_test_sol_checkSuperQuantileQuadrangle_MPI_1 ......................   Passed   21.06 sec
1292/2286 Test #1500: Intrepid_test_Discretization_Integration_Test_07_MPI_1 ...............   Passed   20.28 sec

We may have to work on this a little to bring down the pre-push test time but we can do that later.

bartlettroscoe commented 7 years ago

To show what a CI pre-push build might look like for your average developer, I simulated a change to a TpetraCore source file with:

$ touch packages/tpetra/core/src/Tpetra_CrsMatrix_def.hpp

and then the checkin-test.py script would trigger the enable of TpetraCore and everything downstream. I simulated this on the branch better-ci-build-482 with:

$ ./checkin-test-sems.sh --enable-packages=TpetraCore --local-do-all

The produced the following email:

READY TO PUSH: Trilinos: crf450.srn.sandia.gov

Wed Nov 16 16:27:18 MST 2016

Enabled Packages: TpetraCore
Disabled Packages: PyTrilinos,Claps,TriKota Enabled all Forward Packages

Build test results:
-------------------
0) MPI_RELEASE_DEBUG_SHARED => passed: passed=1408,notpassed=0 (16.10 min)

*** Commits for repo :

0) MPI_RELEASE_DEBUG_SHARED Results:
------------------------------------

  passed: Trilinos/MPI_RELEASE_DEBUG_SHARED: passed=1408,notpassed=0

  Wed Nov 16 16:27:18 MST 2016

  Enabled Packages: TpetraCore
  Disabled Packages: PyTrilinos,Claps,TriKota
  Enabled all Forward Packages
  Hostname: crf450.srn.sandia.gov
  Source Dir: /home/rabartl/Trilinos.base/Trilinos
  Build Dir: /home/rabartl/Trilinos.base/BUILDS/CHECKIN/MPI_RELEASE_DEBUG_SHARED

  CMake Cache Varibles: -DTrilinos_TRIBITS_DIR:PATH=/home/rabartl/Trilinos.base/Trilinos/cmake/tribits -DTrilinos_ENABLE_TESTS:BOOL=ON -DTrilinos_TEST_CATEGORIES:STRING=BASIC -DTrilinos_ALLOW_NO_PACKAGES:BOOL=OFF -DDART_TESTING_TIMEOUT:STRING=300.0 -DBUILD_SHARED_LIBS=ON -DTrilinos_DISABLE_ENABLED_FORWARD_DEP_PACKAGES=ON -DTrilinos_TRACE_ADD_TEST=ON -DTrilinos_ENABLE_SECONDARY_TESTED_CODE:BOOL=OFF -DTPL_ENABLE_MPI=ON -DCMAKE_BUILD_TYPE=RELEASE -DTrilinos_ENABLE_DEBUG=ON -DBUILD_SHARED_LIBS=ON -DTrilinos_ENABLE_DEBUG_SYMBOLS=ON -DTrilinos_ENABLE_CI_TEST_MODE=ON -DTrilinos_ENABLE_EXPLICIT_INSTANTIATION=ON -DTrilinos_ENABLE_SECONDARY_TESTED_CODE=OFF -DTrilinos_ENABLE_TESTS=ON -DTeuchos_ENABLE_DEFAULT_STACKTRACE=OFF -DTrilinos_DISABLE_ENABLED_FORWARD_DEP_PACKAGES=ON -DTrilinos_TRACE_ADD_TEST=ON -DTrilinos_ENABLE_TpetraCore:BOOL=ON -DTrilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=ON -DTrilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES:BOOL=ON -DTrilinos_ENABLE_PyTrilinos:BOOL=OFF -DTrilinos_ENABLE_Claps:BOOL=OFF -DTrilinos_ENABLE_TriKota:BOOL=OFF
  Make Options: -j16
  CTest Options: -j16 

  Pull: Not Performed
  Configure: Passed (2.01 min)
  Build: Passed (5.27 min)
  Test: Passed (8.82 min)

  100% tests passed, 0 tests failed out of 1408

  Label Time Summary:
  Amesos               =  17.90 sec (13 tests)
  Amesos2              =   9.46 sec (7 tests)
  Anasazi              = 104.62 sec (71 tests)
  Belos                =  92.17 sec (61 tests)
  Domi                 = 154.28 sec (125 tests)
  FEI                  =  41.54 sec (43 tests)
  Galeri               =   3.91 sec (9 tests)
  Ifpack               =  58.59 sec (53 tests)
  Ifpack2              =  45.53 sec (32 tests)
  Isorropia            =   7.78 sec (6 tests)
  ML                   =  43.85 sec (34 tests)
  MueLu                = 270.47 sec (54 tests)
  NOX                  = 136.52 sec (100 tests)
  OptiPack             =   6.06 sec (5 tests)
  Panzer               = 263.80 sec (125 tests)
  Pike                 =   2.85 sec (7 tests)
  Piro                 =  24.26 sec (11 tests)
  ROL                  = 663.35 sec (112 tests)
  Rythmos              = 164.30 sec (83 tests)
  ShyLU                =   8.79 sec (5 tests)
  Stokhos              = 100.96 sec (74 tests)
  Stratimikos          =  31.45 sec (39 tests)
  Teko                 = 201.27 sec (19 tests)
  Thyra                =  61.25 sec (80 tests)
  Tpetra               = 122.19 sec (122 tests)
  TrilinosCouplings    =  59.90 sec (19 tests)
  Xpetra               =  39.76 sec (16 tests)
  Zoltan2              = 123.21 sec (91 tests)

  Total Test time (real) = 529.17 sec

  Total time for MPI_RELEASE_DEBUG_SHARED = 16.10 min

So a 16 minute CI build iteration is not too bad. That is pretty close to the XP 10 minute rule:

But even the occasional full from-scratch 82 minute CI build (see above) is not so terrible every once in a while. Pushing once a day (which is all most people should do) this is not terrible overhead, IMHO.

More tweaking to do and then I will submit an offical PR to let people review all of the changes for this CI build.

bartlettroscoe commented 7 years ago

I just realized that I have to do #362 to to address the issue of float and complex testing. Otherwise, all of the other nightly builds for Trilinos may fail if some complex code fails. We need to decide if we want complex testing on or off by default for Nightly builds of Trilinos. Given the responses that I have gotten back from internal Trilinos customers so far (see #362) I think we should consider turning on testing for std::complex<double> as well. We need to see what impact this will have on the build time and runtime of the pre-push CI build. If that seems to be too high, at the very least, we need to run an additional post-push CI build that turns on std::complex<double> and enable std::complex<double> by default for all other CTest driver builds of Trilinos.

bartlettroscoe commented 7 years ago

Since have been discussing Mac OSX in this Issue, I will note that we have located a Mac OSX machine that has the SEMS Env mounted that we can use for a couple of Trilinos nightly builds. See:

Back to work getting this Linux CI build finished up ...

bartlettroscoe commented 7 years ago

I tried adding the build and testing of std::complex<double> to the proposed PT CI build. The results using 16 cores on my machine crf450 gave:

PT CI Build without std::complex<double> (from scratch):

  Configure: Passed (2.45 min)
  Build: Passed (67.11 min)
  Test: Passed (11.57 min)

  100% tests passed, 0 tests failed out of 2286

PT CI Build with std::complex<double> (from scratch):

  Configure: Passed (2.52 min)
  Build: Passed (76.81 min)
  Test: Passed (12.15 min)

  100% tests passed, 0 tests failed out of 2333

As one can see, the overall increase in the build and test times was not very large. The build from scratch went up 13% from 67.11 min to 76.81 min and the tests only went up 5% from 11.57 min to 12.15 min. This seems like a small enough increase that we should consider enabling std::complex<double> in pre-push CI testing. But given that most customers don't use any complex types, we should disable complex types by default (see #362).

bartlettroscoe commented 7 years ago

As of the commit:

0acaa65 "Merge branch 'cleanup-better-ci-build-482' into develop (#482, #362, #158, #831, #811, #826, #410)"
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date:   Mon Nov 28 14:32:22 2016 -0700 (5 hours ago)

This has now been merged to the Trilinos 'develop' branch.

Now I have several other things to do to finish this up starting with getting a poor-man's post-push CI server running to protect this build.

bartlettroscoe commented 7 years ago

I set up the cron job:

# ----------------- minute (0 - 59)                                                                                                                                                 
# |  -------------- hour (0 - 23)                                                                                                                                                   
# |  |  ----------- day of month (1 - 31)                                                                                                                                           
# |  |  |  -------- month (1 - 12)                                                                                                                                                  
# |  |  |  |  ----- day of week (0 - 7) (Sunday=0 or 7)                                                                                                                             
# |  |  |  |  |                                                                                                                                                                     
# *  *  *  *  *  command to be executed                                                                                                                                             
  10 8  *  *  *  cd /ascldap/users/rabartl/Trilinos.base/SEMSCIBuild && ./Trilinos/cmake/ctest/drivers/sems_ci/trilinos_ci_sever.sh &> trilinos_ci_server.out     

on my machine crf450. I manually started it just now and it is showing results at:

for the build Linux-GCC-4.7.2-MPI_RELEASE_DEBUG_SHARED_PT_CI.

I will watch this tomorrow to make sure that it is running okay.

Once this runs for a few days on crf450 in this simple cron job, then we can look at setting this up with a proper Jenkins job on another machine.

Now here is what is left to do:

bartlettroscoe commented 7 years ago

The full CI build ran and completed as shown here:

But it had the problem that it also enabled ST code and therefore failed the configure of PyTrilinos and tried to enable TriKota (which failed of course because Dakota is not cloned under this).

I updated the driver script to accommodate this in 1b14dad.

bartlettroscoe commented 7 years ago

The new crontab line is:

  0  7  *  *  *  cd /ascldap/users/rabartl/Trilinos.base/SEMSCIBuild && cp trilinos_ci_server.out trilinos_ci_server.last.out && ./Trilinos/cmake/ctest/drivers/sems_ci/trilinos_ci_sever.sh &> trilinos_ci_server.out

This will keep a copy of the trilinos_ci_server.out file from the last day and starts an hour earlier.

I will create a new story to improve this CI build to use Jenkins and/or CTest/CDash to drive and report results.

bartlettroscoe commented 7 years ago

An incremental CI iteration just fired of after an updated Belos file was pulled. See the output at:

which shows it starting at Belos which cuts off about 1/2 or more of the Trilinos packages.

We can also see the cost for the package-by-package build from scratch at:

for the Linux-GCC-4.7.2-MPI_RELEASE_DEBUG_SHARED_PT_CI build that fired off at 14:00 UTC (7:00 AM MST). The total accumulated times for the different steps are:

Also, I updated the page:

and it is ready to be reviewed. @jwillenbring and/or @bmpersc, can you please review that page?

bartlettroscoe commented 7 years ago

To address the STK and SEACAS sync issues created by the direct pushes to the stk/cmake/Dependencies.cmake and seacas/cmake/Dependencies.cmake files, I created the SEACAS PR:

and the native STK Trac ticket #15930 for getting SEACAS and STK synced up with the various repos (one of which I can't mention here).

@bmpersc, if you have any questions about this, then lets converse in the comments of the STK Trac ticket #15930 (which you are CCed on).

bartlettroscoe commented 7 years ago

An update on the new CI build that I have set up on my machine crf450 ...

Already there have been several incremental CI iterations today shown at:

where the ROL build was broken and then finally fixed. In the incremental iteration where ROL was fixed, it only took the times:

Compare that feedback time to a from-scratch build for all the packages that took over 3h to build and 26m to run the test . That is why the incremental CI builds enabled by TribitsCTestDriverCore.cmake are so important; they reduce feedback time.

Anyway, it looks like that automated post-push CI build that is supporting the pre-push CI build is working quite well. I will now update the page:

for the current status of the pre-push and post-push CI builds.

bartlettroscoe commented 7 years ago

I finished updating the page:

@jwillenbring and @bmpersc, can you please have a look? Hopefully this explains the current testing strategy and the current status of Trilinos testing along with:

tjfulle commented 7 years ago

@bartlettroscoe, this is probably something you could fix in about 1 second before completing this task - the getConfigurationSearchPaths function in checkin-test.py has the following:

  # Always look for the configuration file assuming the checkin-test.py script
  # is run out of the standard snapshotted tribits directory
  # <project-root>/cmake/tribits/.
  result.append(os.path.join(thisFileRealAbsBasePath, '..', '..'))

But, checkin-test.py actually lives in <project-root>/cmake/tribits/ci_support, so it should be

  result.append(os.path.join(thisFileRealAbsBasePath, '..', '..', '..'))

This doesn't usually cause a problem, since the function also appends the absolute directory of checkin-test.py, which is usually a symlink in the <project-root> directory. But, if checkin-test.py is symlinked elsewhere (say, in a build directory), it fails to find the configuration file.

bartlettroscoe commented 7 years ago

... checkin-test.py actually lives in <project-root>/cmake/tribits/ci_support, so it should be ...

@tjfulle, I see the issue. Surprised there is not a test to catch this. Here is how we need to fix this:

1) Create a TriBITS GitHub issue for this (since checkin-test.py is developed in TriBITS repo).

2) Create an automated test that exposes the problem and the desired behavior

3) Fix the code to make the test pass

No "fixing" code without adding tests first. Note the policy "All functionality will be covered by testing:". Any chance you have time to give this a try on a TriBITS repo branch?

tjfulle commented 7 years ago

I'll open up an issue at the TriBITS site. I can get to fixing it on Friday, if that's not too late

bartlettroscoe commented 7 years ago

NOTE: The process demonstrated in #896 is what we need to do to keep the CI build clean. To help with this, we should create a github account for the trilinos-framework mail list (or a new more restricted mail list) that will be alerted to all CI failures so that we react to them quickly. This is what we do for the CASL VERA project. I know that is not a very nice job but it has to be done.

The good news is that as more people use the updated checkin-test-sems.sh script, the fewer CDash CI failure emails we will get. I would guess that if everyone used the checkin-test-sems.sh script from a RHEL 6 machine with SEMS, then we would only see a failure once every few weeks or less. (Note that failures can still occur due to violations of the additive test assumption of branches).

bartlettroscoe commented 7 years ago

I wrote an initial draft for a wiki page describing how to do development on any machine you want and then use a remote RHEL 6 machine to do the final test and push:

@jwillenbring, can you please review this wiki page and make suggestions for improving it? (also, just fix obvious typos if you find them and reward things that are unclear.)

I also updated the following wiki pages to link to this new page:

dridzal commented 7 years ago

I suppose that this should work with a CEE LAN remote workstation, with minor changes? (the CEE LAN workstation being the remote)

bartlettroscoe commented 7 years ago

I suppose that this should work with a CEE LAN remote workstation, with minor changes? (the CEE LAN workstation being the remote)

@dridzal, the remote pull, test, and push process should work exactly as described on a CEE LAN workstation. I will be setting up for remote pull, test, push once they get my new center-supported CEE LAN machine set up. I will write a simple remote SSH invocation script to automatically fire things off from my local machine. This will help free up my local machine for development. I will post my remote invocation script on that wiki page once it is complete to use as a template for others who want to copy it for themselves.

dridzal commented 7 years ago

And I will start using it as soon as you post it --just let me know.

bartlettroscoe commented 7 years ago

NOTE: The commit f9553e4b8fed5a17c173d139420112bfd71e3a51 that @lxmota just pushed:

commit f9553e4b8fed5a17c173d139420112bfd71e3a51
Merge: d61be65 288daad
Author: Alejandro Mota <amota@crf450.srn.sandia.gov>
Date:   Thu Dec 8 18:14:57 2016 -0700

    Merge branch 'develop' of algol.ca.sandia.gov:/home/amota/LCM/Trilinos into develop

    Build/Test Cases Summary
    Enabled Packages: MiniTensor
    Disabled Packages: PyTrilinos,Claps,TriKota
    Enabled all Forward Packages
    0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=2,notpassed=0 (1.83 min)
    1) MPI_RELEASE_DEBUG_SHARED_PT_COMPLEX => Test case MPI_RELEASE_DEBUG_SHARED_PT_COMPLEX was not run! => Does not affect push readiness! (-1.00 min)
    Other local commits for this build/test group: 288daad, 3587ec2, 5ff54f5, abe132a, ba9f4dd, ab78bc8, 03f1365, b120fc0, 4f8490d, 01d89d6, 4ed047e, 306c68b

shows that he was able to use the remote pull, test, and push process to push his changes from my RHEL 6 machine crf450. Seeing that his email address is wrong, I see that I need to add instructions for setting the git email and user name.

I think this is some validation that this might be a workable approach.

@lxmota, can you comment on the difficulty involved in getting the remote pull, test, and push process set up and invoking it? What might we do to make this easier or more straightforward?

bartlettroscoe commented 7 years ago

This story will be closed very shortly.


From: Bartlett, Roscoe A Sent: Friday, December 09, 2016 4:09 PM To: 'Trilinos Developers List' Cc: 'Trilinos Framework List' Subject: New Trilinos checkin-test policy: checkin-test-sems.sh

Hello Trilinos Developers,

For those Trilinos developers who have chosen to use the checkin-test.py script to safely push changes to Trilinos to try to improve the stability of Trilinos, there is a new preferred process. It involves the usage of a wrapper script checkin-test-sems.sh and its usage is outlined in:

https://github.com/trilinos/Trilinos/wiki/Policies-|-Safe-Checkin-Testing

(this is the link “Pre-push (Checkin) Testing” on the right on the Trilinos GitHub wiki https://github.com/trilinos/Trilinos/wiki.)

For those still using the raw Python checkin-test.py script for other purposes, you can still use the raw script as described in the note at the bottom of that wiki page.

For details on the motivation and status of this effort, see:

https://github.com/trilinos/Trilinos/wiki/Policies-|-Safe-Checkin-Testing

As always, the usage of the checkin-test.py script is optional.

However, I have been informed that rather than trying to get the checkin-test-sems.sh script working on other platforms (like Mac OSX) and allowing people to continue directly testing and pushing changes to Trilinos ‘develop’ themselves, instead a pure pull-request model like is implemented for the INL MOOSE and SNL SST projects will be perused (but the timeframe on this and who will do this is not yet established). Therefore, no work on expanding the usage of the checkin-test-sems.sh approach will continue (and the GitHub issues that focus on that will be closed).

However, for those who still want to continue trying to improve the stability of Trilinos today and choose to use the checkin-test-sems.sh script, while not ideal, enough is in place now to effectively do that as described in the above wiki page (and the pages it links to such as the local development with remote pull, test, and push process). For those who want more details, you can contact me on the workflow that I am going to personally use to make my pushes safer and less difficult. For me, everything is in place now that I need to be productive. (That is, I will build every package and run every test affected by my local changes that is currently passing the CI build in my pre-push process. I will just check the CI build on CDash before I run so that I know what broken CI packages and tests that I need to disable when running the checkin-test-sems.sh script. Given that the CI build is almost constantly broken, one must always do this or their pushes will be stopped.)

Cheers,

-Ross

lxmota commented 7 years ago

@lxmota, can you comment on the difficulty involved in getting the remote pull, test, and push process set up and invoking it? What might we do to make this easier or more straightforward?

It was pretty straightforward. Initially I forgot to setup git properly, so my email address was wrong. But that is stated in your instructions.

The only thing that I would add to the instructions is that for the SEMS environment to work, you need to load the sems-gcc/4.7.2 module.

bartlettroscoe commented 7 years ago

@lxmota,

Thanks for the feedback.

It was pretty straightforward.

Good to hear :-)

Initially I forgot to setup git properly, so my email address was wrong. But that is stated in your instructions.

I added that step after I saw your commit with the wrong email address. That was my omission. About 90% of the setup is what it takes to set up to clone Trilinos and do development on any machine. The only new thing is adding the SSH key and setting the remote (one command).

The only thing that I would add to the instructions is that for the SEMS environment to work, you need to load the sems-gcc/4.7.2 module.

That module and the other modules are loaded automatically when running checkin-test-sems.sh. If you want to manually run do-configure (cmake), make, and ctest, you load the env using load_ci_sems_dev_env.sh as described in:

Thanks,

-Ross

bartlettroscoe commented 7 years ago

Given that we are no longer going to purse extending the checkin-test-sems.sh (or checkin-test.py) script for Trilinos and what is there now is sufficient for people like me to be productive pushing to Trilinos when the CI build is not badly broken (see https://github.com/trilinos/Trilinos/issues/482#issuecomment-266124179), I am putting this in review.

I will leave in until Jan. 1 2017 for comments and then close.

bartlettroscoe commented 7 years ago

FYI: The CI server for Trilinos has been transferred from my machine crf450 to my CEE LAN blade ceerws1113 (which is about 30% slower). For details, see:

lxmota commented 7 years ago

@bartlettroscoe Thanks Ross.

In the end, I caved in and got a CEE account to test both Trilinos and Sierra, so if you wish you can close my account on your machine.

bartlettroscoe commented 7 years ago

I have set up a new script Trilinos/cmake/std/sems/remote-pull-test-push.sh that can be used to invoke a remote pull/test/push process from a local development machine. I have used this several times now using my CEE LAN machine ceerws1113 as a remote pull/test/push server and it is very nice. Now my machine crf450 is no longer loaded down with processes for testing. I can use all of the cores to continue development.

I updated the documentation at:

No need to document a manual process for this anymore.

bartlettroscoe commented 7 years ago

And I will start using it as soon as you post it --just let me know.

@dridzal, this is now posted at:

I have used this a few times and it seems to work pretty well.

If any ROL developer wants to try this, let me know. If they run into any stumbling blocks I can help them through it.

lxmota commented 7 years ago

@bartlettroscoe Cool, I'll definitely give it a try when I need to change something in MiniTensor or ROL, which should be soonish.

bartlettroscoe commented 7 years ago

Cool, I'll definitely give it a try when I need to change something in MiniTensor or ROL, which should be soonish.

@lxmota, great, let me know if you find any typos or other problems with this.

lxmota commented 7 years ago

@bartlettroscoe I tried the local development/remote test & push script, but I have trouble with the module command.

When the Trilinos/cmake/std/sems/remote-pull-test-push.sh is invoked and does ssh -q ... it fails with the error module: command not found

On the other hand, if I load the definition of the module command on my .bashrc for example by doing source /etc/profile.d/00-module-load.sh followed by load module sems-env it fails with ModuleCmd_Load.c(208):ERROR:105: Unable to locate a modulefile for 'sems-env'

So it seems that something is wrong with my module configuration. Is there something SEMS specific that I'm missing?