Closed bartlettroscoe closed 7 years ago
@tjfulle which SEMS parmetis module were you using? There are 3 versions right now, one that is explicitly 64 bit reals, one that is explicitly 32 bit reals, and one that is undecorated. The undecorated version has some issues with the way it was built not being consistent. This inconsistency is mostly exposed when linked with scotch. The explicitly 64 or 32 bit versions were made specifically to address this and are the only versions that should be used and if using scotch as well you need the similarly decorated version for consistency.
FYI: I tried to run the checkin-test-sems.sh script on the branch better-ci-build
on the CEE LAN machine ceesrv02
and it configure and was building just fine (abet very slowly) until it ran out of disk space?
...
/tmp/cchke542.s:3174809: Fatal error: can't close CMakeFiles/PanzerAdaptersSTK_tScatterResidual.dir/scatter_residual.cpp.o: No space left on device
...
Anyway, I suspect that the checkin-test-sems.sh script will run right out of the box on any CEE Linux machine.
There are 3 versions right now, one that is explicitly 64 bit reals, one that is explicitly 32 bit reals, and one that is undecorated.
So which one of these for ParMETIS and Scotch that will work with Trilinos consistently on all these platforms?
@bmpersc I'm not using the SEMS modules. @bartlettroscoe showed that many Muelu tests fail using the default SEMS modules. I believe the default parmetis is 64 bit real/int. I got identical results on my machine. The tests pass after I built metis with 32 bit reals (keeping 64 bit integer)
I'm not sure what you mean by non decorated, metis.h requires that IDXTYPEWIDTH and REALTYPEWIDTH be 32 or 64. By default they are both 32.
@bmpersc and @bartlettroscoe, to my understanding, the distinction between the 32 and 64 bit builds of metis/parmetis
is in how IDXTYPEWIDTH
and REALTYPEWIDTH
are defined in metis.h
. They must be 32
or 64
. The three parmetis
modules in SEMS are labeled 32bit_parallel
, 64bit_parallel
, and parallel
. A diff of metis.h
in 32bit_parallel
and 64bit_parallel
gives:
33c33
< #define IDXTYPEWIDTH 32
---
> #define IDXTYPEWIDTH 64
43c43
< #define REALTYPEWIDTH 32
---
> #define REALTYPEWIDTH 64
as expected. There is no difference between the 64bit_parallel
and parallel
metis.h
, suggesting that there are really only 2 versions of parmetis
.
This is all for Darwin. I don't have access right now to the NFS SEMS mount to check the Linux versions.
Defining IDXTYPEWIDTH=64
and REALTYPEWIDTH=32
allows the previously mentioned MueLu tests to pass for me on Darwin.
@bartlettroscoe both of the versions that explicitly state the 32 or 64 bit work, that was the point. The sems version that doesn't have that decoration wasn't built correctly and had to be replaced. Techinically the parmetis wasn't the problem, it was the the scotch and parmetis weren't built in a compatible way so when linked together they caused problems for a small set of codes. However, to make it clear which parmetis and scotch to use together we rebuilt both in a consistent manner.
@tjfulle you are correct that the differences are fairly minor and only indicate a difference is type sizes. As you can see SEMS does have a version built with 32 bit reals. Can you try using that version to see if you can repeat your success with MueLu tests with that?
@bmpersc, I can try out the 32 bit version, but I'm at a conference so it'll be a couple days. Perhaps @bartlettroscoe could sooner?
@bartlettroscoe does the machine you are testing on have a version of libparmetis in /use/local/lib
?
I was just reminded today why people should not be testing and pushing directly from Mac OSX. That is because the OSX file system is not case sensitive. For example, it has happened many times on many projects that someone used the wrong case for an include file name in a #include "file_name.hpp"
directive or for a source file in a CMakeLists.txt
file and it worked just fine on OSX. But after they pushed and someone on Linux pulled it, they got a broken build.
Therefore, for that reason alone, I would argue that on one should be directly testing and pushing directly from a Mac OSX. What do other people think?
The OSX file system can be configured to be case-sensitive. That could be an added requirement for a machine to push. It is also relatively easy for automated tools to check whether the filesystem is case-insensitive.
The OSX file system can be configured to be case-sensitive.
That breaks the SNL OSX machines. Ask @rppawlo
As I sit in a presentation room today at SC16 and look around me, I see that roughly 85% of the laptops people are using are Mac laptops. Granted, that isn't a scientific sampling, but it does show that Macs are prevalent in our community. Thus, I think we should not restrict developers from using Macs for their testing and development. I would want to know precisely how many build errors case insensitivity has caused in the last year (e.g., how severe a problem this is in practice) before accepting a "solution" that excludes such a commonly used platform.
Wouldn't encouraging developers to use every combination of platform/environment/compiler they can reasonably use increase code robustness? Sure, there might be some pushes to the develop branch from one platform that might temporarily break a build on another, but I imagine those would be fixed pretty quickly. The end result would be more confidence in the master branch (which should not have any "broken" pushes) because it would have been tested in more environments.
@bartlettroscoe wrote:
I was just reminded today why people should not be testing and pushing directly from Mac OSX. That is because the OSX file system is not case sensitive. For example, it has happened many times on many projects that someone used the wrong case for an include file name in a
#include "file_name.hpp"
directive or for a source file in aCMakeLists.txt
file and it worked just fine on OSX. But after they pushed and someone on Linux pulled it, they got a broken build.Therefore, for that reason alone, I would argue that on one should be directly testing and pushing directly from a Mac OSX. What do other people think?
On the other hand, testing on case insensitive file systems will weed out cases where developers write files with the same name but different case. I still see it as a long term win/win if more platforms/compilers/environments are used to develop/test/push.
I think we should not restrict developers from using Macs for their testing and development.
That is not what we are doing here at all. We are just saying, that if we are going to pick one build on one platform to best protect the productivity of all developers and all customers on all platforms, then the best build for doing that is an MPI build (with certain options set) on Linux. For example, case sensitivity errors are caught on Linux but not Mac. And most Trilinos customers are running on Linux not Mac. What we would like to do is to set things up so that if our CI build passed on Linux then there is a high degree of probability that it will also pass on OSX (perhaps with a standard set of compilers, MPI, TPLs, etc.). The major internal customer (that I can't name here) integration effort almost dictates that that one build is a GCC 4.7.2 build on Linux (again, if we are going to pick just one build on one platform for pre-push CI).
Of course we want to support developers on Macs as well. But if making sure that Mac developers don't pull code that is broken on their machine is a super high priority, then we need to consider the git.git workflow mentioned in option-3 in the above comment. That is how you do it. But we can start by adding some automated testing for Mac OSX. Not one such test currently exists (see above).
Wouldn't encouraging developers to use every combination of platform/environment/compiler they can reasonably use increase code robustness?
Yes if use that other development but not for final testing and pushing. For the final test and push, you need to test everything impacted and ensure that you are not adding any new regressions to the offical CI build. Because of the fact that Trilinos tests and even builds of packages are often broken (because people are only testing on their platform before they push or don't run the checkin-test.py script even on Linux), then developers are less likely to enable and test downstream packages and that increases the chances of breaking them even more and it goes down hill from there. That makes the code less robust, not more robust. To make the code more portable, we have a set of Nightly builds that tests that.
Sure, there might be some pushes to the develop branch from one platform that might temporarily break a build on another, but I imagine those would be fixed pretty quickly.
No, they can go on for months. For example, see #826. The idea that "people will fix it quickly" creates a nighmare that is described in:
(a must read for anyone interested in this topic).
I think the issues involved here are more involved that should be discussed in a GitHub issue. For people who are interested, I would encourage people to read the document "Design Patters for Git Workflows" that I have been working on that discusses all of these issues in detail. See:
On the other hand, testing on case insensitive file systems will weed out cases where developers write files with the same name but different case. I still see it as a long term win/win if more platforms/compilers/environments are used to develop/test/push.
That is an argument for testing on both Linux and OSX before merging to the 'develop' branch. That takes us naturally to the git.git workflow mentioned in option-3 or a sophisticated PR-based testing infrastructure like used for SST or MOOSE mention in option-4 in the above comment. Again, option-3 requires almost no infrastructure but is very labor intensive while option-4 requires more infrastructure but is more push-button from developers (until they need to reproduce failures that they can't reproduce on their own machine).
Again, let's discuss this in more detail at the next Trilinos Leaders Meeting.
@bartlettroscoe : I am not sure I agree with your assessment of Option 4. Can we "borrow" some of the tools from SST ? From my understanding of SST presentation, it took six months to set this up. I can't understand why one would need new machines, but machines are cheaper compared to the productivity improvement for every Trilinos developer. It is really hard to we favor a manual process over Option 4.
I went back to GCC 4.7.2 for the Linux CI build on branch better-ci-build-482
. I ran this CI build from scratch on 16 processors with:
$ ./checkin-test-sems.sh -j16 --enable-all-packages=on --local-do-all --wipe-clean
This produced the following email:
READY TO PUSH: Trilinos: crf450.srn.sandia.gov
Wed Nov 16 10:31:56 MST 2016
Enabled Packages:
Disabled Packages: PyTrilinos,Claps,TriKota Enabled all Packages
Build test results:
-------------------
0) MPI_RELEASE_DEBUG_SHARED => passed: passed=2286,notpassed=0 (81.12 min)
*** Commits for repo :
0) MPI_RELEASE_DEBUG_SHARED Results:
------------------------------------
passed: Trilinos/MPI_RELEASE_DEBUG_SHARED: passed=2286,notpassed=0
Wed Nov 16 10:31:56 MST 2016
Enabled Packages:
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Packages
Hostname: crf450.srn.sandia.gov
Source Dir: /home/rabartl/Trilinos.base/Trilinos
Build Dir: /home/rabartl/Trilinos.base/BUILDS/CHECKIN/MPI_RELEASE_DEBUG_SHARED
CMake Cache Varibles: -DTrilinos_TRIBITS_DIR:PATH=/home/rabartl/Trilinos.base/Trilinos/cmake/tribits -DTrilinos_ENABLE_TESTS:BOOL=ON -DTrilinos_TEST_CATEGORIES:STRING=BASIC -DTrilinos_ALLOW_NO_PACKAGES:BOOL=OFF -DDART_TESTING_TIMEOUT:STRING=300.0 -DBUILD_SHARED_LIBS=ON -DTrilinos_DISABLE_ENABLED_FORWARD_DEP_PACKAGES=ON -DTrilinos_TRACE_ADD_TEST=ON -DTrilinos_ENABLE_SECONDARY_TESTED_CODE:BOOL=OFF -DTPL_ENABLE_MPI=ON -DCMAKE_BUILD_TYPE=RELEASE -DTrilinos_ENABLE_DEBUG=ON -DBUILD_SHARED_LIBS=ON -DTrilinos_ENABLE_DEBUG_SYMBOLS=ON -DTrilinos_ENABLE_CI_TEST_MODE=ON -DTrilinos_ENABLE_EXPLICIT_INSTANTIATION=ON -DTrilinos_ENABLE_SECONDARY_TESTED_CODE=OFF -DTrilinos_ENABLE_TESTS=ON -DTeuchos_ENABLE_DEFAULT_STACKTRACE=OFF -DTrilinos_DISABLE_ENABLED_FORWARD_DEP_PACKAGES=ON -DTrilinos_TRACE_ADD_TEST=ON -DTrilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=ON -DTrilinos_ENABLE_ALL_PACKAGES:BOOL=ON -DTrilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES:BOOL=ON -DTrilinos_ENABLE_PyTrilinos:BOOL=OFF -DTrilinos_ENABLE_Claps:BOOL=OFF -DTrilinos_ENABLE_TriKota:BOOL=OFF
Make Options: -j16
CTest Options: -j16
Pull: Not Performed
Configure: Passed (2.45 min)
Build: Passed (67.11 min)
Test: Passed (11.57 min)
100% tests passed, 0 tests failed out of 2286
Label Time Summary:
Amesos = 18.82 sec (13 tests)
Amesos2 = 8.88 sec (7 tests)
Anasazi = 103.95 sec (71 tests)
AztecOO = 17.36 sec (17 tests)
Belos = 92.98 sec (61 tests)
Domi = 157.07 sec (125 tests)
Epetra = 50.73 sec (61 tests)
EpetraExt = 13.49 sec (10 tests)
FEI = 41.77 sec (43 tests)
Galeri = 4.58 sec (9 tests)
GlobiPack = 1.70 sec (6 tests)
Ifpack = 62.90 sec (53 tests)
Ifpack2 = 47.16 sec (32 tests)
Intrepid = 202.79 sec (152 tests)
Intrepid2 = 106.84 sec (107 tests)
Isorropia = 8.45 sec (6 tests)
Kokkos = 255.67 sec (21 tests)
ML = 47.49 sec (34 tests)
MueLu = 259.34 sec (54 tests)
NOX = 136.16 sec (100 tests)
OptiPack = 6.55 sec (5 tests)
Panzer = 266.06 sec (125 tests)
Phalanx = 3.63 sec (15 tests)
Pike = 3.35 sec (7 tests)
Piro = 25.29 sec (11 tests)
ROL = 679.04 sec (112 tests)
RTOp = 14.04 sec (24 tests)
Rythmos = 160.33 sec (83 tests)
SEACAS = 6.25 sec (8 tests)
STK = 12.77 sec (12 tests)
Sacado = 98.58 sec (290 tests)
Shards = 1.16 sec (4 tests)
ShyLU = 8.41 sec (5 tests)
Stokhos = 104.20 sec (74 tests)
Stratimikos = 30.69 sec (39 tests)
Teko = 201.45 sec (19 tests)
Teuchos = 54.40 sec (123 tests)
ThreadPool = 8.16 sec (10 tests)
Thyra = 65.90 sec (80 tests)
Tpetra = 127.45 sec (122 tests)
TrilinosCouplings = 57.11 sec (19 tests)
Triutils = 2.32 sec (2 tests)
Xpetra = 39.54 sec (16 tests)
Zoltan = 197.31 sec (16 tests)
Zoltan2 = 134.23 sec (91 tests)
Total Test time (real) = 694.08 sec
Total time for MPI_RELEASE_DEBUG_SHARED = 81.12 min
The detailed configure and ctest output is shown at:
I had to increase the timeout from 3 minutes to 5 minutes as a few Trilinos tests take a lot longer to run with GCC 4.7.2 than they did with GCC 5.3.0. The most expensive BASIC tests for GCC 4.7.2 are:
968/2286 Test #30: KokkosContainers_UnitTest_MPI_1 ....................................... Passed 166.69 sec
2267/2286 Test #2112: ROL_example_PDE-OPT_0ld_adv-diff-react_example_02_MPI_4 .............. Passed 122.16 sec
1432/2286 Test #1403: Teko_testdriver_tpetra_MPI_1 ......................................... Passed 97.44 sec
835/2286 Test #537: Zoltan_hg_simple_zoltan_parallel ...................................... Passed 88.00 sec
2210/2286 Test #2107: ROL_example_PDE-OPT_0ld_poisson_example_01_MPI_4 ..................... Passed 74.91 sec
745/2286 Test #533: Zoltan_ch_simple_zoltan_parallel ...................................... Passed 66.56 sec
1968/2286 Test #1833: MueLu_ParameterListInterpreterTpetra_MPI_1 ........................... Passed 57.31 sec
2162/2286 Test #2059: ROL_test_sol_solSROMGenerator_MPI_1 .................................. Passed 56.74 sec
2238/2286 Test #2118: ROL_example_PDE-OPT_topo-opt_elasticity_example_01_MPI_4 ............. Passed 56.15 sec
1815/2286 Test #1402: Teko_testdriver_tpetra_MPI_4 ......................................... Passed 54.86 sec
570/2286 Test #31: KokkosAlgorithms_UnitTest_MPI_1 ....................................... Passed 49.49 sec
2056/2286 Test #1915: Rythmos_IntegratorBuilder_ConvergenceTest_MPI_1 ...................... Passed 45.43 sec
526/2286 Test #11: KokkosCore_UnitTest_Serial_MPI_1 ...................................... Passed 35.56 sec
2144/2286 Test #2084: ROL_example_parabolic-control_example_03_MPI_1 ....................... Passed 30.96 sec
2007/2286 Test #1911: Rythmos_BackwardEuler_ConvergenceTest_MPI_1 .......................... Passed 29.24 sec
2284/2286 Test #534: Zoltan_ch_simple_parmetis_parallel .................................... Passed 27.67 sec
2133/2286 Test #2076: ROL_example_burgers-control_example_06_MPI_1 ......................... Passed 27.12 sec
1850/2286 Test #1831: MueLu_ParameterListInterpreterEpetra_MPI_1 ........................... Passed 26.98 sec
2233/2286 Test #2119: ROL_example_PDE-OPT_topo-opt_poisson_example_01_MPI_4 ................ Passed 25.34 sec
2279/2286 Test #2245: PanzerAdaptersSTK_MixedPoissonExample-ConvTest ....................... Passed 22.84 sec
1838/2286 Test #1824: MueLu_UnitTestsTpetra_MPI_1 .......................................... Passed 21.49 sec
1199/2286 Test #1475: Intrepid_test_Discretization_Basis_HGRAD_TRI_Cn_FEM_Test_02_MPI_1 .... Passed 21.42 sec
2107/2286 Test #2063: ROL_test_sol_checkSuperQuantileQuadrangle_MPI_1 ...................... Passed 21.06 sec
1292/2286 Test #1500: Intrepid_test_Discretization_Integration_Test_07_MPI_1 ............... Passed 20.28 sec
We may have to work on this a little to bring down the pre-push test time but we can do that later.
To show what a CI pre-push build might look like for your average developer, I simulated a change to a TpetraCore source file with:
$ touch packages/tpetra/core/src/Tpetra_CrsMatrix_def.hpp
and then the checkin-test.py script would trigger the enable of TpetraCore and everything downstream. I simulated this on the branch better-ci-build-482
with:
$ ./checkin-test-sems.sh --enable-packages=TpetraCore --local-do-all
The produced the following email:
READY TO PUSH: Trilinos: crf450.srn.sandia.gov
Wed Nov 16 16:27:18 MST 2016
Enabled Packages: TpetraCore
Disabled Packages: PyTrilinos,Claps,TriKota Enabled all Forward Packages
Build test results:
-------------------
0) MPI_RELEASE_DEBUG_SHARED => passed: passed=1408,notpassed=0 (16.10 min)
*** Commits for repo :
0) MPI_RELEASE_DEBUG_SHARED Results:
------------------------------------
passed: Trilinos/MPI_RELEASE_DEBUG_SHARED: passed=1408,notpassed=0
Wed Nov 16 16:27:18 MST 2016
Enabled Packages: TpetraCore
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages
Hostname: crf450.srn.sandia.gov
Source Dir: /home/rabartl/Trilinos.base/Trilinos
Build Dir: /home/rabartl/Trilinos.base/BUILDS/CHECKIN/MPI_RELEASE_DEBUG_SHARED
CMake Cache Varibles: -DTrilinos_TRIBITS_DIR:PATH=/home/rabartl/Trilinos.base/Trilinos/cmake/tribits -DTrilinos_ENABLE_TESTS:BOOL=ON -DTrilinos_TEST_CATEGORIES:STRING=BASIC -DTrilinos_ALLOW_NO_PACKAGES:BOOL=OFF -DDART_TESTING_TIMEOUT:STRING=300.0 -DBUILD_SHARED_LIBS=ON -DTrilinos_DISABLE_ENABLED_FORWARD_DEP_PACKAGES=ON -DTrilinos_TRACE_ADD_TEST=ON -DTrilinos_ENABLE_SECONDARY_TESTED_CODE:BOOL=OFF -DTPL_ENABLE_MPI=ON -DCMAKE_BUILD_TYPE=RELEASE -DTrilinos_ENABLE_DEBUG=ON -DBUILD_SHARED_LIBS=ON -DTrilinos_ENABLE_DEBUG_SYMBOLS=ON -DTrilinos_ENABLE_CI_TEST_MODE=ON -DTrilinos_ENABLE_EXPLICIT_INSTANTIATION=ON -DTrilinos_ENABLE_SECONDARY_TESTED_CODE=OFF -DTrilinos_ENABLE_TESTS=ON -DTeuchos_ENABLE_DEFAULT_STACKTRACE=OFF -DTrilinos_DISABLE_ENABLED_FORWARD_DEP_PACKAGES=ON -DTrilinos_TRACE_ADD_TEST=ON -DTrilinos_ENABLE_TpetraCore:BOOL=ON -DTrilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=ON -DTrilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES:BOOL=ON -DTrilinos_ENABLE_PyTrilinos:BOOL=OFF -DTrilinos_ENABLE_Claps:BOOL=OFF -DTrilinos_ENABLE_TriKota:BOOL=OFF
Make Options: -j16
CTest Options: -j16
Pull: Not Performed
Configure: Passed (2.01 min)
Build: Passed (5.27 min)
Test: Passed (8.82 min)
100% tests passed, 0 tests failed out of 1408
Label Time Summary:
Amesos = 17.90 sec (13 tests)
Amesos2 = 9.46 sec (7 tests)
Anasazi = 104.62 sec (71 tests)
Belos = 92.17 sec (61 tests)
Domi = 154.28 sec (125 tests)
FEI = 41.54 sec (43 tests)
Galeri = 3.91 sec (9 tests)
Ifpack = 58.59 sec (53 tests)
Ifpack2 = 45.53 sec (32 tests)
Isorropia = 7.78 sec (6 tests)
ML = 43.85 sec (34 tests)
MueLu = 270.47 sec (54 tests)
NOX = 136.52 sec (100 tests)
OptiPack = 6.06 sec (5 tests)
Panzer = 263.80 sec (125 tests)
Pike = 2.85 sec (7 tests)
Piro = 24.26 sec (11 tests)
ROL = 663.35 sec (112 tests)
Rythmos = 164.30 sec (83 tests)
ShyLU = 8.79 sec (5 tests)
Stokhos = 100.96 sec (74 tests)
Stratimikos = 31.45 sec (39 tests)
Teko = 201.27 sec (19 tests)
Thyra = 61.25 sec (80 tests)
Tpetra = 122.19 sec (122 tests)
TrilinosCouplings = 59.90 sec (19 tests)
Xpetra = 39.76 sec (16 tests)
Zoltan2 = 123.21 sec (91 tests)
Total Test time (real) = 529.17 sec
Total time for MPI_RELEASE_DEBUG_SHARED = 16.10 min
So a 16 minute CI build iteration is not too bad. That is pretty close to the XP 10 minute rule:
But even the occasional full from-scratch 82 minute CI build (see above) is not so terrible every once in a while. Pushing once a day (which is all most people should do) this is not terrible overhead, IMHO.
More tweaking to do and then I will submit an offical PR to let people review all of the changes for this CI build.
I just realized that I have to do #362 to to address the issue of float and complex testing. Otherwise, all of the other nightly builds for Trilinos may fail if some complex code fails. We need to decide if we want complex testing on or off by default for Nightly builds of Trilinos. Given the responses that I have gotten back from internal Trilinos customers so far (see #362) I think we should consider turning on testing for std::complex<double>
as well. We need to see what impact this will have on the build time and runtime of the pre-push CI build. If that seems to be too high, at the very least, we need to run an additional post-push CI build that turns on std::complex<double>
and enable std::complex<double>
by default for all other CTest driver builds of Trilinos.
Since have been discussing Mac OSX in this Issue, I will note that we have located a Mac OSX machine that has the SEMS Env mounted that we can use for a couple of Trilinos nightly builds. See:
Back to work getting this Linux CI build finished up ...
I tried adding the build and testing of std::complex<double>
to the proposed PT CI build. The results using 16 cores on my machine crf450 gave:
PT CI Build without std::complex<double>
(from scratch):
Configure: Passed (2.45 min)
Build: Passed (67.11 min)
Test: Passed (11.57 min)
100% tests passed, 0 tests failed out of 2286
PT CI Build with std::complex<double>
(from scratch):
Configure: Passed (2.52 min)
Build: Passed (76.81 min)
Test: Passed (12.15 min)
100% tests passed, 0 tests failed out of 2333
As one can see, the overall increase in the build and test times was not very large. The build from scratch went up 13% from 67.11 min to 76.81 min and the tests only went up 5% from 11.57 min to 12.15 min. This seems like a small enough increase that we should consider enabling std::complex<double>
in pre-push CI testing. But given that most customers don't use any complex types, we should disable complex types by default (see #362).
As of the commit:
0acaa65 "Merge branch 'cleanup-better-ci-build-482' into develop (#482, #362, #158, #831, #811, #826, #410)"
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date: Mon Nov 28 14:32:22 2016 -0700 (5 hours ago)
This has now been merged to the Trilinos 'develop' branch.
Now I have several other things to do to finish this up starting with getting a poor-man's post-push CI server running to protect this build.
I set up the cron job:
# ----------------- minute (0 - 59)
# | -------------- hour (0 - 23)
# | | ----------- day of month (1 - 31)
# | | | -------- month (1 - 12)
# | | | | ----- day of week (0 - 7) (Sunday=0 or 7)
# | | | | |
# * * * * * command to be executed
10 8 * * * cd /ascldap/users/rabartl/Trilinos.base/SEMSCIBuild && ./Trilinos/cmake/ctest/drivers/sems_ci/trilinos_ci_sever.sh &> trilinos_ci_server.out
on my machine crf450. I manually started it just now and it is showing results at:
for the build Linux-GCC-4.7.2-MPI_RELEASE_DEBUG_SHARED_PT_CI.
I will watch this tomorrow to make sure that it is running okay.
Once this runs for a few days on crf450 in this simple cron job, then we can look at setting this up with a proper Jenkins job on another machine.
Now here is what is left to do:
Create PR for SEACAS repo for changes to seacas/cmake/Dependencies.cmake file ...
Create Sierra STK Trac ticket for changes to stk/cmake/Dependencies.cmake file and push branch to Sierra repo for this change ...
Send email to Brent P. (CC Kendal P., Greg S.) on how to handle changes to STK and SEACAS when syncing changes from Trilinos to Sierra STK and bringing changes to STK and SEACAS from Sierra back to Trilinos (i.e. just revert the two Trilinos commits on local branch) ...
Set up Trilinos GitHub wiki page for new checkin-test-sems.sh script with full instructions (and sadness about OSX) ... IN PROGRESS ...
Update Trilinos GitHub wiki page about the Trilinos testing strategy ...
Create new Trilinos GitHub issues for cleaning up a few Tpetra configure issues (see discussion on #362) ...
Create new GitHub issues for removing more expensive tests from pre-push CI testing ...
Create new GitHub issues for Trilinos failures on OSX with GCC 5.3.0 with exact reproducability instructions (for STK, for ROL, etc.). See https://github.com/trilinos/Trilinos/issues/811#issuecomment-261046793 ...
Create new GitHub issues for failed builds with -DTrilinos_ENBALE_FLOAT=ON ...
Create new GitHub issue for getting Teuchos, Sacado, and Anasazi to split ENABLE_COMPLEX into ENABLE_COMPLEX_FLOAT and ENABLE_COMPLEX_DOUBLE ...
Create new GitHub issue for creating a test matrix for Trilinos to make sure that different combinations of enables/disables are tested ...
Discuss the challenges of setting up a CI system using SEMS Env on Linux and OSX (if you want to keep all builds and tests clean on both platforms, you can't just test and push on one machine) ..
The full CI build ran and completed as shown here:
But it had the problem that it also enabled ST code and therefore failed the configure of PyTrilinos and tried to enable TriKota (which failed of course because Dakota is not cloned under this).
I updated the driver script to accommodate this in 1b14dad.
The new crontab line is:
0 7 * * * cd /ascldap/users/rabartl/Trilinos.base/SEMSCIBuild && cp trilinos_ci_server.out trilinos_ci_server.last.out && ./Trilinos/cmake/ctest/drivers/sems_ci/trilinos_ci_sever.sh &> trilinos_ci_server.out
This will keep a copy of the trilinos_ci_server.out file from the last day and starts an hour earlier.
I will create a new story to improve this CI build to use Jenkins and/or CTest/CDash to drive and report results.
An incremental CI iteration just fired of after an updated Belos file was pulled. See the output at:
which shows it starting at Belos which cuts off about 1/2 or more of the Trilinos packages.
We can also see the cost for the package-by-package build from scratch at:
for the Linux-GCC-4.7.2-MPI_RELEASE_DEBUG_SHARED_PT_CI build that fired off at 14:00 UTC (7:00 AM MST). The total accumulated times for the different steps are:
Also, I updated the page:
and it is ready to be reviewed. @jwillenbring and/or @bmpersc, can you please review that page?
To address the STK and SEACAS sync issues created by the direct pushes to the stk/cmake/Dependencies.cmake and seacas/cmake/Dependencies.cmake files, I created the SEACAS PR:
and the native STK Trac ticket #15930
for getting SEACAS and STK synced up with the various repos (one of which I can't mention here).
@bmpersc, if you have any questions about this, then lets converse in the comments of the STK Trac ticket #15930
(which you are CCed on).
An update on the new CI build that I have set up on my machine crf450 ...
Already there have been several incremental CI iterations today shown at:
where the ROL build was broken and then finally fixed. In the incremental iteration where ROL was fixed, it only took the times:
Compare that feedback time to a from-scratch build for all the packages that took over 3h to build and 26m to run the test . That is why the incremental CI builds enabled by TribitsCTestDriverCore.cmake are so important; they reduce feedback time.
Anyway, it looks like that automated post-push CI build that is supporting the pre-push CI build is working quite well. I will now update the page:
for the current status of the pre-push and post-push CI builds.
I finished updating the page:
@jwillenbring and @bmpersc, can you please have a look? Hopefully this explains the current testing strategy and the current status of Trilinos testing along with:
@bartlettroscoe, this is probably something you could fix in about 1 second before completing this task - the getConfigurationSearchPaths
function in checkin-test.py
has the following:
# Always look for the configuration file assuming the checkin-test.py script
# is run out of the standard snapshotted tribits directory
# <project-root>/cmake/tribits/.
result.append(os.path.join(thisFileRealAbsBasePath, '..', '..'))
But, checkin-test.py
actually lives in <project-root>/cmake/tribits/ci_support
, so it should be
result.append(os.path.join(thisFileRealAbsBasePath, '..', '..', '..'))
This doesn't usually cause a problem, since the function also appends the absolute directory of checkin-test.py
, which is usually a symlink in the <project-root>
directory. But, if checkin-test.py
is symlinked elsewhere (say, in a build directory), it fails to find the configuration file.
... checkin-test.py actually lives in
<project-root>/cmake/tribits/ci_support
, so it should be ...
@tjfulle, I see the issue. Surprised there is not a test to catch this. Here is how we need to fix this:
1) Create a TriBITS GitHub issue for this (since checkin-test.py is developed in TriBITS repo).
2) Create an automated test that exposes the problem and the desired behavior
3) Fix the code to make the test pass
No "fixing" code without adding tests first. Note the policy "All functionality will be covered by testing:". Any chance you have time to give this a try on a TriBITS repo branch?
I'll open up an issue at the TriBITS site. I can get to fixing it on Friday, if that's not too late
NOTE: The process demonstrated in #896 is what we need to do to keep the CI build clean. To help with this, we should create a github account for the trilinos-framework mail list (or a new more restricted mail list) that will be alerted to all CI failures so that we react to them quickly. This is what we do for the CASL VERA project. I know that is not a very nice job but it has to be done.
The good news is that as more people use the updated checkin-test-sems.sh script, the fewer CDash CI failure emails we will get. I would guess that if everyone used the checkin-test-sems.sh script from a RHEL 6 machine with SEMS, then we would only see a failure once every few weeks or less. (Note that failures can still occur due to violations of the additive test assumption of branches).
I wrote an initial draft for a wiki page describing how to do development on any machine you want and then use a remote RHEL 6 machine to do the final test and push:
@jwillenbring, can you please review this wiki page and make suggestions for improving it? (also, just fix obvious typos if you find them and reward things that are unclear.)
I also updated the following wiki pages to link to this new page:
I suppose that this should work with a CEE LAN remote workstation, with minor changes? (the CEE LAN workstation being the remote)
I suppose that this should work with a CEE LAN remote workstation, with minor changes? (the CEE LAN workstation being the remote)
@dridzal, the remote pull, test, and push process should work exactly as described on a CEE LAN workstation. I will be setting up for remote pull, test, push once they get my new center-supported CEE LAN machine set up. I will write a simple remote SSH invocation script to automatically fire things off from my local machine. This will help free up my local machine for development. I will post my remote invocation script on that wiki page once it is complete to use as a template for others who want to copy it for themselves.
And I will start using it as soon as you post it --just let me know.
NOTE: The commit f9553e4b8fed5a17c173d139420112bfd71e3a51 that @lxmota just pushed:
commit f9553e4b8fed5a17c173d139420112bfd71e3a51
Merge: d61be65 288daad
Author: Alejandro Mota <amota@crf450.srn.sandia.gov>
Date: Thu Dec 8 18:14:57 2016 -0700
Merge branch 'develop' of algol.ca.sandia.gov:/home/amota/LCM/Trilinos into develop
Build/Test Cases Summary
Enabled Packages: MiniTensor
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Forward Packages
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=2,notpassed=0 (1.83 min)
1) MPI_RELEASE_DEBUG_SHARED_PT_COMPLEX => Test case MPI_RELEASE_DEBUG_SHARED_PT_COMPLEX was not run! => Does not affect push readiness! (-1.00 min)
Other local commits for this build/test group: 288daad, 3587ec2, 5ff54f5, abe132a, ba9f4dd, ab78bc8, 03f1365, b120fc0, 4f8490d, 01d89d6, 4ed047e, 306c68b
shows that he was able to use the remote pull, test, and push process to push his changes from my RHEL 6 machine crf450. Seeing that his email address is wrong, I see that I need to add instructions for setting the git email and user name.
I think this is some validation that this might be a workable approach.
@lxmota, can you comment on the difficulty involved in getting the remote pull, test, and push process set up and invoking it? What might we do to make this easier or more straightforward?
This story will be closed very shortly.
From: Bartlett, Roscoe A Sent: Friday, December 09, 2016 4:09 PM To: 'Trilinos Developers List' Cc: 'Trilinos Framework List' Subject: New Trilinos checkin-test policy: checkin-test-sems.sh
Hello Trilinos Developers,
For those Trilinos developers who have chosen to use the checkin-test.py script to safely push changes to Trilinos to try to improve the stability of Trilinos, there is a new preferred process. It involves the usage of a wrapper script checkin-test-sems.sh and its usage is outlined in:
https://github.com/trilinos/Trilinos/wiki/Policies-|-Safe-Checkin-Testing
(this is the link “Pre-push (Checkin) Testing” on the right on the Trilinos GitHub wiki https://github.com/trilinos/Trilinos/wiki.)
For those still using the raw Python checkin-test.py script for other purposes, you can still use the raw script as described in the note at the bottom of that wiki page.
For details on the motivation and status of this effort, see:
https://github.com/trilinos/Trilinos/wiki/Policies-|-Safe-Checkin-Testing
As always, the usage of the checkin-test.py script is optional.
However, I have been informed that rather than trying to get the checkin-test-sems.sh script working on other platforms (like Mac OSX) and allowing people to continue directly testing and pushing changes to Trilinos ‘develop’ themselves, instead a pure pull-request model like is implemented for the INL MOOSE and SNL SST projects will be perused (but the timeframe on this and who will do this is not yet established). Therefore, no work on expanding the usage of the checkin-test-sems.sh approach will continue (and the GitHub issues that focus on that will be closed).
However, for those who still want to continue trying to improve the stability of Trilinos today and choose to use the checkin-test-sems.sh script, while not ideal, enough is in place now to effectively do that as described in the above wiki page (and the pages it links to such as the local development with remote pull, test, and push process). For those who want more details, you can contact me on the workflow that I am going to personally use to make my pushes safer and less difficult. For me, everything is in place now that I need to be productive. (That is, I will build every package and run every test affected by my local changes that is currently passing the CI build in my pre-push process. I will just check the CI build on CDash before I run so that I know what broken CI packages and tests that I need to disable when running the checkin-test-sems.sh script. Given that the CI build is almost constantly broken, one must always do this or their pushes will be stopped.)
Cheers,
-Ross
@lxmota, can you comment on the difficulty involved in getting the remote pull, test, and push process set up and invoking it? What might we do to make this easier or more straightforward?
It was pretty straightforward. Initially I forgot to setup git properly, so my email address was wrong. But that is stated in your instructions.
The only thing that I would add to the instructions is that for the SEMS environment to work, you need to load the sems-gcc/4.7.2 module.
@lxmota,
Thanks for the feedback.
It was pretty straightforward.
Good to hear :-)
Initially I forgot to setup git properly, so my email address was wrong. But that is stated in your instructions.
I added that step after I saw your commit with the wrong email address. That was my omission. About 90% of the setup is what it takes to set up to clone Trilinos and do development on any machine. The only new thing is adding the SSH key and setting the remote (one command).
The only thing that I would add to the instructions is that for the SEMS environment to work, you need to load the sems-gcc/4.7.2 module.
That module and the other modules are loaded automatically when running checkin-test-sems.sh
. If you want to manually run do-configure (cmake), make, and ctest, you load the env using load_ci_sems_dev_env.sh
as described in:
Thanks,
-Ross
Given that we are no longer going to purse extending the checkin-test-sems.sh (or checkin-test.py) script for Trilinos and what is there now is sufficient for people like me to be productive pushing to Trilinos when the CI build is not badly broken (see https://github.com/trilinos/Trilinos/issues/482#issuecomment-266124179), I am putting this in review.
I will leave in until Jan. 1 2017 for comments and then close.
FYI: The CI server for Trilinos has been transferred from my machine crf450 to my CEE LAN blade ceerws1113 (which is about 30% slower). For details, see:
@bartlettroscoe Thanks Ross.
In the end, I caved in and got a CEE account to test both Trilinos and Sierra, so if you wish you can close my account on your machine.
I have set up a new script Trilinos/cmake/std/sems/remote-pull-test-push.sh
that can be used to invoke a remote pull/test/push process from a local development machine. I have used this several times now using my CEE LAN machine ceerws1113
as a remote pull/test/push server and it is very nice. Now my machine crf450
is no longer loaded down with processes for testing. I can use all of the cores to continue development.
I updated the documentation at:
No need to document a manual process for this anymore.
And I will start using it as soon as you post it --just let me know.
@dridzal, this is now posted at:
I have used this a few times and it seems to work pretty well.
If any ROL developer wants to try this, let me know. If they run into any stumbling blocks I can help them through it.
@bartlettroscoe Cool, I'll definitely give it a try when I need to change something in MiniTensor or ROL, which should be soonish.
Cool, I'll definitely give it a try when I need to change something in MiniTensor or ROL, which should be soonish.
@lxmota, great, let me know if you find any typos or other problems with this.
@bartlettroscoe I tried the local development/remote test & push script, but I have trouble with the module command.
When the Trilinos/cmake/std/sems/remote-pull-test-push.sh
is invoked and does ssh -q ...
it fails with the error module: command not found
On the other hand, if I load the definition of the module command on my .bashrc
for example by doing source /etc/profile.d/00-module-load.sh
followed by load module sems-env
it fails with ModuleCmd_Load.c(208):ERROR:105: Unable to locate a modulefile for 'sems-env'
So it seems that something is wrong with my module configuration. Is there something SEMS specific that I'm missing?
Next Action Status:
New CI build is pushed to 'develop', new post-push CI server is running, and new checkin-test-sem.sh script ready for more testing and review ... Note going to pursue other extensions (e.g. mac OSX, tcsh, etc.). See https://github.com/trilinos/Trilinos/issues/482#issuecomment-266124179. Next: Leave in review til 1/1/2017 then close.
Blocked By: #158, #410, #362
Blocking: #380
Related To: #370, #475, #476
CC: @trilinos/framework
Description:
Trilinos has not had an effective pre-push CI development process for many years. When the checkin-test.py script was first created (back in 2008 or so), the primary stack of packages was based on Epetra and the main external dependencies were C/C++/Fortran compilers and BLAS and LAPACK. Those dependencies and the major Trilinos customers at the time were used to select the initial set of Primary Tested (initially called Primary Stable) packages that is being used to this day. However, since that time, many new Trilinos packages have been added and important Trilinos customers are relying on many of these newer packages (e.g. SEACAS, STK, Tpetra, Phalanx, Panzer, etc.). In addition, these new Trilinos packages require more dependencies than just BLAS and LAPACK and now TPLs like Boost, HDF5, NetCDF, ParMETIS, SuperLU and others used by Trilinos are also very important to many Trilinos customers.
Another problem with the current pre-push CI testing processes with Trilinos is that Trilinos developers have a variety of different types of machines, OSs, versions of compilers, TPL implementations, etc. that they use to develop on and push changes for Trilinos. This has resulted in people who tried to use the checkin-test.py script to suffer failed pushes due to failing tests on their machine not triggered by their changes. In contract, projects that have a uniform pre-push CI testing env don't experience these types of problems. One example of such a project is CASL VERA that uses TriBITS and the checkin-test.py script and has a set of uniform development machines where developers almost never see tests that fail in their build of the code that passed on another developer's build. Therefore, the only failed builds and tests are due to their own local changes. In that project, there is no trepidation to running the checkin-test.py script and everyone uses it uniformly for nearly every push.
Another problem with the current CI testing process for Trilinos is that the post-push CI server that posts to CDash enables a different set of packages and TPLs from what the pre-push CI build does (and of course uses different compilers, MPI, etc.). Therefore, a CI build/test failure seen on CDash may not be seen with the checkin-test.py script locally and visa vera. This makes it difficult for developers to determine if the failures they are seeing on their own machine are due to their local changes or due differences with the env on their machine compared to the machine running the CI build posting to CDash, if it is due to a different set of enabled packages and TPL or something else.
As a result, the stability of the main Trilinos development branch (now the 'develop' branch, see #370) has degraded from what it was 5+ years ago. This is a problem because Trilinos needs to have a more stable 'develop' branch in order to more frequently update from the 'develop' branch to the 'master' branch (see #370).
This story is to address all of these shortcomings of the current Trilinos CI testing process. The new SEMS Dev Env (#158) provides an opportunity to create a fairly portable (at least for SNL staff members) uniform pre-push and post-push CI testing environment for the first time.
Here is the plan for setting up a more effective CI process based on the SEMS Dev Env, the checkin-test.py script, and CTest/CDash:
load_ci_sems_dev_env.sh
script, which just calls thelocal_sems_dev_env.sh
script with the selections.load_ci_sems_dev_env.sh
. This should likely only run a single build of Trilinos to speed up the testing/push process. (If there is a single build is would likely include-DTPL_ENABLE_MPI=ON -DCMAKE_BULD_TYPE=RELEASE -DTriinos_ENABLE_DEBUG=ON -DBUILD_SHARED_LIBS=ON -DTrilinos_ENABLE_EXPLICIT_INSTANTIATION=ON -DTrilinos_ENABLE_FLOAT=OFF -DTrilinos_ENABLE_COMPLEX=OFF
. See #362 about turning off float and complex gy default.)After this Story is complete, then we can create new Stories to get Trilinos developers to use the checkin-test-sems.sh script and to commit to keeping the CI build(s) 100% all the time with "Stop the Line" urgency to fix.
Definition of Done:
load_ci_sems_dev_env.sh
andcheckin-test-sems.sh
that provides a viable CI build based on the SEMS Dev Env.load_ci_sems_dev_env.sh
andcheckin-test-sems.sh
has been written and has been reviewed by a few Trilinos developers.load_ci_sems_dev_env.sh
env and the same default build(s) as defined in thecheckin-test-sems.sh
script.checkin-test.py
script itself to determine what improvements that might help with usability and adoption.Decisions that need to be made:
Tasks:
load_ci_sems_dev_env.sh
andcheckin-test-sems.sh
[Done]--default-builds
for thecheckin-test.py
and therefore thecheckin-test-sems.sh
script" [Done]better-ci-build-482
... IN PROGRESS ...checkin-test-sems.sh
[Done]checkint-test-sems.py --local-do-all
[Done]