Closed mhoemmen closed 5 years ago
@mhoemmen - for POWER we may need to specifically use IBM's Advance Toolchain which provides an optimized and patched GCC 7.2. I agree that early 7.2 testing would be useful.
@micahahoward specifically mentioned IBM's Toolchain. He has been testing with stock GCC 7.2, no CUDA, just to make sure that stuff builds.
@mhoemmen, I think that SEMS would need to add a GCC 7.2 build for the SEMS dev env before this could happen. I will not this specific request in:
The SEMS issue to add a GCC 7.2 build of the env is:
Thanks @bartlettroscoe ! :-D I'll let @micahahoward know.
@rrdrake and @prwolfe may also be interested.
@mhoemmen, specifically what machine and env is this customer trying to build Trilinos with GCC 7.2?
@mhoemmen, how urgent? Would the GCC 7.2 modules on the CEE LAN work for setting up this testing with the ASC IC Jenkins build farm? If not, then SEMS would need to install a GCC 7.2 stack.
@bartlettroscoe SPARC is the customer; ATS-2 the target machine. I'll let @micahahoward answer questions about urgency. Thanks!
So we've come across a bug in gcc 5 and 6 that breaks our use of hierarchic parallelism in kokkos. Having a sems 7.2 module would be valuable to EMPIRE/Drekar/Charon2/Panzer teams as we migrate towards hierarchic parallelism. The workaround is a very ugly hack in the source code. No pressure on a timeline as we will use the hack for now.
@mhoemmen , @nmhamster , @bartlettroscoe : Re urgency: This is forward looking and not that urgent. It's driven by wanting to be ready for the IBM Advanced Toolchain when it's ready to go on ATS-2-like testbeds. Having Trilinos tested with GCC 7.2 by early Jan would be good, which means starting to do that now is about right.
FWIW, I was able to build our configuration of Trilinos with 7.2 and get all of our application code tests to pass without any significant issues. Having some assurance that remains true with what I'd like to have.
@micahahoward, @mhoemmen,
Having Trilinos tested with GCC 7.2 by early Jan would be good, which means starting to do that now is about right.
Okay then. I will put in an offical SEMS request to set up a GCC 7.2 env and put a place-holder issue for this in
I will request that this env be available by early Jan.
@bartlettroscoe @mhoemmen @micahahoward - Blake (Sky Lake) test bed, has a very initial GCC 7.2.0 test bed environment for Trilinos builds. Note that #2041 is filed for a Teuchos bug I found this weekend.
And @bartlettroscoe - I think there is a TriBITS bug (OpenMP Detection for Fortran when using GCC 7.2.0 Fails #244, https://github.com/TriBITSPub/TriBITS/issues/244) I have raised as well but this may not be connected to GCC 7.2.0.
@nmhamster git blame
claims that @hkthorn fixed the missing #include <vector>
on 29 Nov.
I ran a set of tests from the GCC 7.2.0 build on Blake overnight. The following failed:
Label Time Summary:
Amesos = 4.87 sec (9 tests)
Amesos2 = 6.97 sec (9 tests)
AztecOO = 6.29 sec (11 tests)
Belos = 92.86 sec (64 tests)
Epetra = 24.80 sec (52 tests)
EpetraExt = 6.41 sec (10 tests)
Ifpack = 23.98 sec (41 tests)
Ifpack2 = 40.68 sec (35 tests)
Kokkos = 47.64 sec (23 tests)
ML = 9.81 sec (16 tests)
MueLu = 598.66 sec (80 tests)
SEACAS = 9.34 sec (7 tests)
STK = 1.41 sec (2 tests)
Shards = 0.18 sec (1 test)
Teuchos = 71.16 sec (126 tests)
Tpetra = 200.37 sec (135 tests)
Zoltan = 473.94 sec (26 tests)
Zoltan2 = 67.39 sec (94 tests)
Total Test time (real) = 1686.72 sec
The following tests FAILED:
137 - TeuchosNumerics_LAPACK_test_MPI_1 (Not Run)
646 - Ifpack2_BlockTriDiContainerUnitAndPerfTests_MPI_4 (Failed)
647 - Ifpack2_BlockRelaxationPerformance_MPI_1 (Failed)
Verbose output from the failing tests is below. We know the Teuchos numerics break because of issue #2041.
ctest -R Ifpack2_BlockTriDiContainerUnitAndPerfTests_MPI_4 -V
UpdateCTestConfiguration from :/ascldap/users/sdhammo/git/trilinos-github-repo/build-devpack-gcc-720/DartConfiguration.tcl
Parse Config file:/ascldap/users/sdhammo/git/trilinos-github-repo/build-devpack-gcc-720/DartConfiguration.tcl
Add coverage exclude regular expressions.
SetCTestConfiguration:CMakeCommand:/home/projects/x86-64/cmake/3.9.0/bin/cmake
UpdateCTestConfiguration from :/home/sdhammo/git/trilinos-github-repo/build-devpack-gcc-720/DartConfiguration.tcl
Parse Config file:/home/sdhammo/git/trilinos-github-repo/build-devpack-gcc-720/DartConfiguration.tcl
Test project /ascldap/users/sdhammo/git/trilinos-github-repo/build-devpack-gcc-720
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 646
Start 646: Ifpack2_BlockTriDiContainerUnitAndPerfTests_MPI_4
646: Test command: /ascldap/users/projects/x86-64-skylake/openmpi/2.1.2/gcc/7.2.0/bin/mpiexec "-np" "4" "--map-by" "numa:PE=4" "/home/sdhammo/git/trilinos-github-repo/build-devpack-gcc-720/packages/ifpack2/test/unit_tests/Ifpack2_BlockTriDiContainerUnitAndPerfTests.exe"
646: Test timeout computed to be: 1500
646: Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
646: In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
646: For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
646: For unit testing set OMP_PROC_BIND=false
646: Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
646: In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
646: For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
646: For unit testing set OMP_PROC_BIND=false
646: Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
646: In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
646: For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
646: For unit testing set OMP_PROC_BIND=false
646: Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
646: In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
646: For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
646: For unit testing set OMP_PROC_BIND=false
646: <I> nranks 4 ni 10 nj 10 nk 10 bs 5 nrhs 1 isplit 4 jsplit 1 nthreads 8
646:
646: ***
646: *** Unit test suite ...
646: ***
646:
646:
646: Sorting tests by group name then by the order they were added ... (time = 5.95e-06)
646:
646: Running unit tests ...
646:
646: [node01:36640] *** Process received signal ***
646: [node01:36638] *** Process received signal ***
646: [node01:36639] *** Process received signal ***
646: --------------------------------------------------------------------------
646: mpiexec noticed that process rank 1 with PID 36638 on node node01 exited on signal 11 (Segmentation fault).
646: --------------------------------------------------------------------------
1/1 Test #646: Ifpack2_BlockTriDiContainerUnitAndPerfTests_MPI_4 ...***Failed Required regular expression not found.Regex=[End Result: TEST PASSED
] 3.52 sec
0% tests passed, 1 tests failed out of 1
Label Time Summary:
Ifpack2 = 3.52 sec (1 test)
Total Test time (real) = 4.20 sec
The following tests FAILED:
646 - Ifpack2_BlockTriDiContainerUnitAndPerfTests_MPI_4 (Failed)
Errors while running CTest
And:
ctest -R Ifpack2_BlockRelaxationPerformance_MPI_1 -V
UpdateCTestConfiguration from :/ascldap/users/sdhammo/git/trilinos-github-repo/build-devpack-gcc-720/DartConfiguration.tcl
Parse Config file:/ascldap/users/sdhammo/git/trilinos-github-repo/build-devpack-gcc-720/DartConfiguration.tcl
Add coverage exclude regular expressions.
SetCTestConfiguration:CMakeCommand:/home/projects/x86-64/cmake/3.9.0/bin/cmake
UpdateCTestConfiguration from :/home/sdhammo/git/trilinos-github-repo/build-devpack-gcc-720/DartConfiguration.tcl
Parse Config file:/home/sdhammo/git/trilinos-github-repo/build-devpack-gcc-720/DartConfiguration.tcl
Test project /ascldap/users/sdhammo/git/trilinos-github-repo/build-devpack-gcc-720
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 647
Start 647: Ifpack2_BlockRelaxationPerformance_MPI_1
647: Test command: /ascldap/users/projects/x86-64-skylake/openmpi/2.1.2/gcc/7.2.0/bin/mpiexec "-np" "1" "--map-by" "numa:PE=4" "/home/sdhammo/git/trilinos-github-repo/build-devpack-gcc-720/packages/ifpack2/test/unit_tests/Ifpack2_BlockRelaxationPerformance.exe"
647: Test timeout computed to be: 1500
647: Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name node01.blake.sandia.gov and rank 0!
647:
647: ***
647: *** Unit test suite ...
647: ***
647:
647:
647: Sorting tests by group name then by the order they were added ... (time = 2.97e-06)
647:
647: Running unit tests ...
647:
647: 0. Ifpack2BlockRelaxation_Performance_UnitTest ... Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
647: In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
647: For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
647: For unit testing set OMP_PROC_BIND=false
647:
647: Testing Ifpack2 block G-S
647: Testing block size: 2
647: (24 trials)
647: Testing block size: 4
647: (44 trials)
647: Testing block size: 5
647: (56 trials)
647: Testing block size: 8
647: (82 trials)
647: Testing block size: 10
647: (101 trials)
647: Testing block size: 20
647:
647: p=0: *** Caught standard std::exception of type 'std::runtime_error' :
647:
647: /home/sdhammo/git/trilinos-github-repo/packages/ifpack2/src/Ifpack2_TriDiContainer_def.hpp:259:
647:
647: Throw number = 1
647:
647: Throw test that evaluated to true: INFO > 0
647:
647: Ifpack2::TriDiContainer::factor: LAPACK's _GTTRF (LU factorization with partial pivoting) reports that the computed U factor is exactly singular. U(1,1) (one-based index i) is exactly zero. This probably means that the input matrix has a singular diagonal block.
647: [FAILED] (3.97 sec) Ifpack2BlockRelaxation_Performance_UnitTest
647: Location: /home/sdhammo/git/trilinos-github-repo/packages/ifpack2/test/unit_tests/Ifpack2_UnitTestBlockRelaxationPerf.cpp:107
647:
647:
647: The following tests FAILED:
647: 0. Ifpack2BlockRelaxation_Performance_UnitTest ...
647:
647: Total Time: 3.97 sec
647:
647: Summary: total = 1, run = 1, passed = 0, failed = 1
647:
647: End Result: TEST FAILED
647: -------------------------------------------------------
647: Primary job terminated normally, but 1 process returned
647: a non-zero exit code.. Per user-direction, the job has been aborted.
647: -------------------------------------------------------
647: --------------------------------------------------------------------------
647: mpiexec detected that one or more processes exited with non-zero status, thus causing
647: the job to be terminated. The first process to do so was:
647:
647: Process name: [[37086,1],0]
647: Exit code: 1
647: --------------------------------------------------------------------------
1/1 Test #647: Ifpack2_BlockRelaxationPerformance_MPI_1 ...***Failed 6.71 sec
0% tests passed, 1 tests failed out of 1
Label Time Summary:
Ifpack2 = 6.71 sec (1 test)
Total Test time (real) = 7.19 sec
The following tests FAILED:
647 - Ifpack2_BlockRelaxationPerformance_MPI_1 (Failed)
Errors while running CTest
@ambrad Huh, is that your Ifpack2 code? I can help.
For Sky Lake we will also need to address issue #2050.
FYI: I created the targeted issue:
to make sure we target this in the CDOFA process.
A GCC 7.2.0 build env was installed by @fryeguy52 on the SEMS NFS mount under the atdm
project area. I set up a simple shell script Trilinos/cmake/std/atdm/load_atdm_7.2_dev_env.sh
in the branch atdm-gcc-7.2.0-2028
pushed to the GitHub rerpo git@github.com:bartlettroscoe/Trilinos.git
. I tried the full build of all of the PT packages. The configure intially failed due to what seems to be a defect in FIND_LIBRARY()
in CMake 3.10.0 (see details below). But with CMake 3.5.2, the configure passed. With this initial env, the build failed with a build failure in the EpetraEx HDF5 adapters. This build was posted to CDash at:
This one build failure resulted in a lot of test link failures so a lot of tests never even ran. Since the ATDM build of Trilinos turns off HDF5 support in EpetraExt, I will disable it and try again.
@bartlettroscoe Awesome!!! Thanks so much for setting this up!!! :-D
FYI: I created the issue https://gitlab.kitware.com/snl/project-1/issues/47 to see if Kitware can look into the apparent defect with FIND_LIBRARY()
with CMake 3.10 and get this fixed. Otherwise, we will need to explicitly list the full set of TPL libraries for SCOTCH and perhaps other TPLs as well to get around this. Otherwise, we will not be able to use the all-at-once configure, build, and test and submit nicely to the new CDash site.
I reran the full Trilinos CI PT build again with the GCC 7.2.0 env with -DEpetraExt_ENABLE_HDF5=OFF
and the build passed and all of the tests passed as shown at:
However, it produces a ton of link warnings that say:
/usr/bin/ld: warning: libgfortran.so.3, needed by /usr/lib64/liblapack.so, may conflict with libgfortran.so.4
I know that the SEMS policy is not to build and install BLAS and LAPACK but it looks like that might be necassary with newer versions of GCC.
@fryeguy52,
Can you please look into adding the install of BLAS/LAPACK to this ATDM SEMS GCC 7.2.0 env? I think that would eliminate these link warnings. In any cas, you should have the exact reproducability instructions.
I went ahead and moved the script and pushed in the commit 259322e9a92b9ad8019f9e3f1c945ed51a487684:
commit 259322e9a92b9ad8019f9e3f1c945ed51a487684
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date: Thu Jan 4 14:02:43 2018 -0700
Add env script for SEMS ATDM GCC 7.2 env (#2028)
This is a drop-in replacement for load_sems_dev_env.sh to be able to configure
and build Trilinos, but with a custom set of modules. Just soruce this and
then you can use the SEMSEnv.cmake module.
NOTE: Had to switch from CMake 3.10.0 to CMake 3.5.1 to avoid FIND_LIBRARY()
defect (#2028).
We will need to get Kitware to fix that defect otherwise will not be able to
use the all-at-once configure, build, test, and submit to CDash :-(
Or, we can just hack the SEMSEnv.cmake file to avoid the find like in commit
0960c822a186216ef394fee5b8e9efce50d7585c.
A cmake/std/sems/atdm/load_atdm_7.2_dev_env.sh
@fryeguy52,
It just occurred to me that an easy way to install BLAS and LAPACK it use SPACK:
I just did this with:
$ cd $HOME/SPARC.base/
$ git clone git@github.com:spack/spack.git
$ cd spack/bin
./spack install lapack -j16
==> Installing openblas
==> Using cached archive: /home/rabartl/SPACK.base/spack/var/spack/cache/openblas/openblas-0.2.20.tar.gz
==> Staging archive: /home/rabartl/SPACK.base/spack/var/spack/stage/openblas-0.2.20-foilt3stjn4aqxpsk3asyanbjphwcuwb/v0.2.20.tar.gz
==> Created stage in /home/rabartl/SPACK.base/spack/var/spack/stage/openblas-0.2.20-foilt3stjn4aqxpsk3asyanbjphwcuwb
==> Applied patch make.patch
==> Building openblas [MakefilePackage]
==> Executing phase: 'edit'
==> Executing phase: 'build'
==> Executing phase: 'install'
==> Successfully installed openblas
Fetch: 0.02s. Build: 8m 34.39s. Total: 8m 34.41s.
[+] /home/rabartl/SPACK.base/spack/opt/spack/linux-rhel6-x86_64/gcc-4.8.3/openblas-0.2.20-foilt3stjn4aqxpsk3asyanbjphwcuwb
I need to do this with the ATDM SEMS GCC 7.2.0 module loaded first and then test that it works with our build of Trilinos but my guess is that will work.
At the very least you can look at what SPACK does to install BLAS and LAPACK and then duplicate this in the SEMS TPL installer infrastructure. Hopefully that is not too hard.
I created an official SEMS request to install BLAS and LAPACK for GCC 7.2.0 in:
But we might just look into calling SPACK? But SPACK has its own ideas of a directory structure that may not be compatible with the SEMS way of installing these so it might not be as easy as just calling SPACK.
@fryeguy52,
I installed OpenBLAS using SPACK for GCC 7.2.0 on my local machine as described below and seems to work without generating any link warnings about libgfortran.so version incompatibilities.
Can you just run SPACK inside of the NFS mounded drive under the atdm project area as I demonstrated below (see details) and then repeat the test build I show below? Let me know how that goes. If it works, then I will update the script Trilinos/cmake/std/sems/atdm/load_atdm_7.2_dev_env.sh and SEMSDevEnv.cmake to use that version of BLAS and LAPACK. Then we can get a nightly build going for Trilinos with this version.
I will go ahead and do a full nighlty build of Trilinos and submit to CDash.
Interestingly, when I updated Trilinos and did the full build again, I am already getting a build error in Kokkos with GCC 7.2.0 as shown in:
where the build error shows:
/scratch/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/unit_test/UnitTest_PushFinalizeHook.cpp: In function ‘int main(int, char**)’:
/scratch/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/unit_test/UnitTest_PushFinalizeHook.cpp:136:10: error: no match for ‘operator<<’ (operand types are ‘std::basic_ostream<char>::__ostream_type {aka std::basic_ostream<char>}’ and ‘std::ostringstream {aka std::__cxx11::basic_ostringstream<char>}’)
cout << "FAILED:" << endl
~~~~~~~~~~~~~~~~~~~~~~~~~
<< " Expected output:" << endl
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<< expectedOutput << endl
~~~~~~~~~~~~~~~~~~~~~~~~~
<< " Actual output:" << endl
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<< hookOutput << endl;
^~~~~~~~~~~~~
/scratch/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/unit_test/UnitTest_PushFinalizeHook.cpp:136:10: note: candidate: operator<<(int, int) <built-in>
/scratch/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/unit_test/UnitTest_PushFinalizeHook.cpp:136:10: note: no known conversion for argument 2 from ‘std::ostringstream {aka std::__cxx11::basic_ostringstream<char>}’ to ‘int’
In file included from /projects/sems/install/rhel6-x86_64/atdm/compiler/gcc/7.2.0/base/include/c++/7.2.0/iostream:39:0,
from /scratch/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/unit_test/UnitTest_PushFinalizeHook.cpp:46:
/projects/sems/install/rhel6-x86_64/atdm/compiler/gcc/7.2.0/base/include/c++/7.2.0/ostream:108:7: note: candidate: std::basic_ostream<_CharT, _Traits>::__ostream_type& std::basic_ostream<_CharT, _Traits>::operator<<(std::basic_ostream<_CharT, _Traits>::__ostream_type& (*)(std::basic_ostream<_CharT, _Traits>::__ostream_type&)) [with _CharT = char; _Traits = std::char_traits<char>; std::basic_ostream<_CharT, _Traits>::__ostream_type = std::basic_ostream<char>]
operator<<(__ostream_type& (*__pf)(__ostream_type&))
[...]
It looks like the breaking commit is:
0607bcd "Kokkos: Add Kokkos::push_finalize_hook function & tests (#2129)"
Author: Mark Hoemmen <mhoemmen@users.noreply.github.com>
Date: Fri Jan 5 10:39:30 2018 -0700 (5 days ago)
M packages/kokkos/core/src/Kokkos_Core.hpp
M packages/kokkos/core/src/impl/Kokkos_Core.cpp
M packages/kokkos/core/unit_test/CMakeLists.txt
M packages/kokkos/core/unit_test/Makefile
A packages/kokkos/core/unit_test/UnitTest_PushFinalizeHook.cpp
A packages/kokkos/core/unit_test/UnitTest_PushFinalizeHook_terminate.cpp
I am backing up to the version of Trilinos before this that worked shown [above]() that was rooted at commit:
38a2158 "Shylu/Tacho - hand made team blas for gpu"
Author: Kyungjoo Kim <kyukim@sandia.gov>
Date: Wed Jan 3 14:46:51 2018 -0700 (6 days ago)
A packages/shylu/shylu_node/tacho/src/TachoExp_Blas_Team.hpp
M packages/shylu/shylu_node/tacho/src/TachoExp_Util.hpp
M packages/shylu/shylu_node/tacho/unit-test/Tacho_Test.hpp
A packages/shylu/shylu_node/tacho/unit-test/Tacho_TestDenseLinearAlgebra.hpp
M packages/shylu/shylu_node/tacho/unit-test/Tacho_TestOpenMP_double.cpp
M packages/shylu/shylu_node/tacho/unit-test/Tacho_TestSerial_dcomplex.cpp
M packages/shylu/shylu_node/tacho/unit-test/Tacho_TestSerial_double.cpp
and posting to CDash at:
I tested this and it passed the Kokkos build so this should build and link just fine.
This is a good example of why we need this automated build and simple instructions that any SNL Trilinos developer can use to reproduce build problems.
@bartlettroscoe That test passed perfectly fine on other platforms, but yes, I second the need for automated testing. Did you actually revert the commit or just disable the failing test?
I ran the full build and test of Trilinos with GCC 7.2.0 with the SPACK-built OpenBLAS BLAS and LAPACK and -DEpetraExt_ENABLE_HDF5=OFF
which submitted to:
This built and passed all of the tests (see details below) but we are still seeing link warnings:
/usr/bin/ld: warning: libgfortran.so.3, needed by /usr/lib64/liblapack.so, may conflict with libgfortran.so.4
I belive the reason for this is that some of the other existing ATDM SEMS TPLs like SuperLU likely and perhaps others are built against the default system BLAS and LAPACK instead of the new SPACK built and installed OpenBLAS BLAS and LAPACK implementations built with GCC 7.2.0. This means that to get rid of all of these link warnings we would need to rebuild all of the downstream TPLs that depend on BLAS and/or LAPACK.
So the question is, should we work to fix these link warnings right now or just get this GCC 7.2.0 build up and going to start protecting Trilinos with a basic GCC 7.2.0 build? I am thinking that given that the Trilinos build for GCC 7.2.0 was just broken for an update to Trilinos as shown [above]() that it makes sense to just get this build running for now with the link warnings. Then later we can rebuild the TPLs with BLAS and LAPACK and eliminate these link warnings.
P.S. The other thing is that we really need to be using CMake 3.10.0 so that we can use the all-at-once configure, build, test, and submit but partition the output on the new CDash site. That would make it much more readable.
@bartlettroscoe wrote:
... [S]hould we work to fix these link warnings right now or just get this GCC 7.2.0 build up and going to start protecting Trilinos with a basic GCC 7.2.0 build?
The latter, please :-). Thanks Ross!
Current feedback from SEMS is that they will need to take the issue of officially supporting builds of BLAS and LAPACK to the SEMS Stewards. Therefore, I think we should move ahead and just get this GCC 7.2.0 build going up to the Specialized
track on the CDash site so we can get it cleaned up again (and then move it to Nightly
?).
Longer term, we need to rebuild the TPLs from source against BLAS and LAPaCk build with GCC 7.2.0. One option is to just use SPACK to build everything from GCC 7.2.0 on up for the TPLs (including BLAS and LAPACK) that we need and bypass the SEMS TPL installation process. We could put this under the ATDM project area on the mounted SEMS NFS drive. I think SPACK supports modules so that might be an easy solution and would have the added benefit that people could build these envs on non-SNL machines.
I rebased the branch atdm-gcc-7.2.0-2028
on top of develop
, added a new *.cmake file for the special configuration options for this build and disabled the build and run of the failing KokkosCore_UnitTest_PushFinalizeHook
test and pushed to the remote:
To github.com:bartlettroscoe/Trilinos.git
+ c677baf...46cf207 atdm-gcc-7.2.0-2028 -> atdm-gcc-7.2.0-2028 (forced update)
I then tested the configure, build, and test of Kokkos, Teuchos, and EpetraExt and posted to CDash with make dashboard
to:
This is now ready to use to build a CTest -S driver script then run it with Jenkins. That should be easy.
I created the basic driver scripts:
and did created a new *.cmake file for extra configuration options:
I ran the script drive_linux_mpi_sems_atdm_7.2.0.sh
locally (as shown in details below) and it submitted to:
Interstingly, there are three new failing Tempus tests that were not there before. I will create a new GitHub issue for that in a bit
@fryeguy52 set up the Jenkins job to drive this build on the SEMS SRN Build Farm. However, it resulted in all failed configures shown here:
The configure failures said that it was missing the file:
-- Reading in configuration options from cmake/std/sems/atdm/SEMSATDMSettings.cmake ...
CMake Error at /jenkins/slave/workspace/Trilinos_gcc-7-2-0_atdm/Trilinos/cmake/tribits/core/package_arch/TribitsGlobalMacros.cmake:167 (INCLUDE):
INCLUDE could not find load file:
cmake/std/sems/atdm/SEMSATDMSettings.cmake
Turns out the problem was that for some reason the inner driver cloned the "nightly" git repo software.sandia.gov:/space/git/nightly/Trilinos
instead of the GitHub repo. therefore, the develop branch was not up-to-date yet in that repo so the new commits did not exist in the repo yet.
I switched the Jenkins job Trilinos_gcc-7-2-0_atdm
to also clone the "nighty" git repo for Trilinos so the outer and inner Trilinos repos always match up. Therefore, the build should run tonight and submit to CDash.
Looks like as predicted above, the Jenkins job Trilinos_gcc-7-2-0_atdm ran just find and posted correct output to CDash at:
This showed just two failing Tempus tests this time. (I will create a GitHub issue for those.)
The only problem is that I misspelled "ATDM" as "ADTM". I just fixed that with the commit 2ac807d so it should be spelled right tomorrow.
I will also create follow-on issues to get the TPLs rebuilt with BLAS and LAPACK (to eliminate link warnings with libgfortran) and and then build newer versions of the TPLs that were requested by the ATDM APPs in https://software-srn.sandia.gov/jira/browse/CDOFA-24. Then I will descope this Issue and put it in review.
Several Tempus tests are timing out as shown at:
I am wondering why this is occurring. How does Jenkins know how many cores are used in each build? Is that the Jenkins "Job Weight" parameter? If so, it is currently set to "6" for the Trilinos_gcc-7-2-0_atdm
job. But I know that this build takes 10 cores in the build and running the tests (see the file Trilinos/cmake/ctest/drivers/atdm/ctest_linux_mpi_sems_atdm_7.2.0.cmake
). Therefore, I will change "Job Weight" to "10". Looking at the chart:
it looks like the Jenkins slave machine gretel
was getting fully loaded (20 cores) from about 21:49 to about 04:03. This must be the cause of the timeouts of the Tempus tests.
@fryeguy52 and @trilinos/framework,
How can we make sure that all of the Jenkins jobs that are running on these machines have the correct "Job Weight" property so that Jenkins does not overload its slave machines? The "Job Weight" value was set incorrectly for our job so how can we determine if it is being set correctly for other jobs as well?
Looks like the machine hansel
is getting overloaded for periods of time as well:
I sent the following email to see if we can investigate this issue more and see what can be done.
Otherwise, I will just increase the default timeout from 300s to 600s and see if it makes the timeouts go away.
From: trilinos-framework-bounces@software.sandia.gov [mailto:trilinos-framework-bounces@software.sandia.gov] On Behalf Of Bartlett, Roscoe A Sent: Monday, January 15, 2018 10:12 AM To: Frye, Joe ; trilinos-framework@software.sandia.gov Subject: [Trilinos-Framework] Jenkins jobs overloading slave machines?
Hello Joe and Trilinos Framework team members,
It seems that Jenkins is overloading Jenkins slave machines and causing timeouts (see https://github.com/trilinos/Trilinos/issues/2028#issuecomment-357706804). It seems there is a “Job Weight” setting that is supposed to tell Jenkins how many cores a job will use (kind of like the PROCESSORS CTest property). I think this caused a bunch of timeouts on the new GCC 7.2.0 build run on the machine Gretel. There are a bunch of other jobs that are being run there as well as shown at https://jenkins-srn.sandia.gov/computer/gretel/builds . Jenkins has to be set up to not fully load (or worse overload) a test machine or multi-process MPI jobs will take much longer to run and will cause timeouts like this.
How can we get to the bottom of this?
Thanks,
-Ross
I discussed this with @jwillenbring and @fryeguy52 and one suggestion was to increase the "Job Weight" value to make sure that this job takes up an entire machine. My concern with doing that is that I am afraid that the job may not be scheduled at all. I think we need a better strategy to manage these Jenkins build machines. I will bring this up at the next CDOFA meeting.
It looks like increasing the timeout limit from 300s to 600s (i.e. 10 minutes) fixed all of the timeouts were were seeing. The build Linux-GCC-7.2.0-MPI_RELEASE_ATDM
today shown at:
has all passing tests. Digging deeper and looking at the test times shown here, you can see that the most expensive test was Tempus_DIRK_Combined_FSA_MPI_1
at 7m 43s 950ms
. When I ran the full test suite on an older version of Trilinos on my machine ceerws1113
and posted results to:
the test times were much smaller as shown here and the same test Tempus_DIRK_Combined_FSA_MPI_1
has the time 5m 20s 410ms
. That is a 31% increase in the runtime for the test. My guess is that if the machine is over-loaded that we will see a lot of fluctuation in these tests times over the coming days.
In any case, this build is now ready to elevate from Specialized
to some CDash Track/Group that will emails. As discussed in #1293, that can't be the Nightly
group because that group does not send out any CDash emails. We could move this into the Clean
group but I am not sure that is the right thing to do. I will suggest adding an ATDM
group that will send out emails and then send it there.
Our current GCC 7.2.0 build is all passing but it is not enabling or running with OpenMP. Should it be? See https://github.com/trilinos/Trilinos/issues/2130#issuecomment-358081448.
Feedback from the EMPIRE ATDM APP lead that we should be enabling OpenMP and testing with OMP_NUM_THREADS=4. I am trying that now.
I tried the simple enable of Trilinos_ENABLE_OpenMP=ON
in the trial commit 3b609f0ad2f2ff9d253ceec058bb982b5812b93c pushed to the branch 2028-enable-openmp
. I did a all-at-once submit to CDash which is shown (on the trial CDash site that supports the new all-at-once method) at:
and details are shown below.
This trial build showed a single Panzer build failure but the major problem was that almost all of the tests in downstream packages died on startup due to missing instantations from Tpetra functions instantiated for a Serial
type (but different tests showed different missing function definitions). It is not clear why the linker did not even warn about these missing symboles. In any case, the strightforward enable of OpenMP is not working at all. Getting an OpenMP build working should be a seprate issue. Also, while other OpenMP builds are failing on CDash we should likey wait until those builds are worked out before pushing on this with a GCC 7.2.0 build.
FYI: I renamed the Jenkins build to Trilinos-atdm-sems-gcc-7-2-0
. This is more consistent with the other ATDM builds we are setting up.
@bartlettroscoe Kokkos changed its configuration recently in such a way as to disable all but one Tpetra execution space instantiation by default.
@bartlettroscoe Kokkos changed its configuration recently in such a way as to disable all but one Tpetra execution space instantiation by default.
Is this fixable? Are there any automated builds showing this problem on the Trilinos CDash site:
Kokkos changed its configuration recently in such a way as to disable all but one Tpetra execution space instantiation by default.
@mhoemmen,
Is this the reason for the OpenMP build failures that I reported above?
@bartlettroscoe I think it could be, yes.
The GCC 7.2.0 build Linux-GCC-7.2.0-MPI_RELEASE_ATDM
has been running fine until this morning when it ran on the machine "winstone" and it failed to find BLAS:
Before that it ran on the machine 'hansel' and 'gretel' (cute). It looks like those machines both have the label "RHEL6" so I will add that the to Jenkins job:
Hopefully this will keep this from happening again. It looks like that should fix this.
I fired off the build manually again so hopefully it will resubmit and show up clean now.
I made the commit cb26a95ab884d4a3c7324a48d427cdc90f5ad1b6:
commit cb26a95ab884d4a3c7324a48d427cdc90f5ad1b6
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date: Mon Jan 29 19:25:03 2018 -0700
Moved GCC 7.2.0 files into subdir and other changes (#2028)
* Make the CDash build name the same as the Jenkins name
* Moved files to subdir to be consistent with other ATDM driver files and not
clutter the base dir.
* Remove extra repos
* Change CDash build name to same as Jenkins name and to be more consistent
with other ATDM build names
* Send to ATDM track
R090 cmake/ctest/drivers/atdm/ctest_linux_mpi_sems_atdm_7.2.0.cmake cmake/ctest/drivers/atdm/sems_gcc-7.2.0/ctest_linux_mpi_sems_atdm_7.2.0.cmake
R091 cmake/ctest/drivers/atdm/drive_linux_mpi_sems_atdm_7.2.0.sh cmake/ctest/drivers/atdm/sems_gcc-7.2.0/drive_linux_mpi_sems_atdm_7.2.0.sh
Now I need to watch CDash tomorrow morning to make sure that the build shows up correctly. I did some testing locally so I have high hopes that this will work.
The updated GCC 7.2.0 build showed up correctly now under the same name as the :
Next Action Status
Using SPACK-installed BLAS and LAPACK does not eliminate link warnings due to SuperLU still being linked against system BLAS and LAPACK but otherwise the
Linux-GCC-7.2.0-MPI_RELEASE_ATDM
. Next: Get CDash to send out failure emails for thisLinux-GCC-7.2.0-MPI_RELEASE_ATDM
and then reate new follow-up issues and reduce scope of this issue so it can be closed ...Description
Trilinos' Dashboard does not appear to have GCC 7.2 coverage. The latest version of GCC being tested regularly is 5.2. However, it looks like CUDA-enabled application builds on Sierra and kin will require either XL or GCC 7.2. Furthermore, our early experience with GCC 7.2 is that it is stricter about accepting code (e.g., requiring the
template
keyword when calling templated methods). This means that it would help to have GCC 7.2 builds on the Dashboard.@trilinos/framework @micahahoward
Expectations
Trilinos has a GCC 7.2 Dashboard build, at least without CUDA for now, that builds all the solver packages.
Current Behavior
The latest GCC version currently exercised on the Dashboard is 5.2.
Motivation and Context
Sierra and kin need either XL or GCC 7.2 builds. Applications are already testing with GCC 7.2.
Tasks
Trilinos/cmake/std/sems/atdm/load_gcc_7.2_env.sh
script and then use theSEMSEnv.cmake
module to test Trilinos [DONE]Trilinos/cmake/std/sems/atdm/load_gcc_7.2_env.sh
andSEMSEnv.cmake
to pull in new BLAS and LAPACK ... Does not resolve the link warnings due to superlu so we will not try to resolve right now and just keep the link warnings for now [SKIPPED]Nightly
build with this ATDM GCC 7.2.0 env to Trilinos CDash site ... (Ross)Nightly
Track/Group ...