trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.2k stars 565 forks source link

Add GCC 7.2 build to Dashboard #2028

Closed mhoemmen closed 5 years ago

mhoemmen commented 6 years ago

Next Action Status

Using SPACK-installed BLAS and LAPACK does not eliminate link warnings due to SuperLU still being linked against system BLAS and LAPACK but otherwise the Linux-GCC-7.2.0-MPI_RELEASE_ATDM. Next: Get CDash to send out failure emails for this Linux-GCC-7.2.0-MPI_RELEASE_ATDM and then reate new follow-up issues and reduce scope of this issue so it can be closed ...

Description

Trilinos' Dashboard does not appear to have GCC 7.2 coverage. The latest version of GCC being tested regularly is 5.2. However, it looks like CUDA-enabled application builds on Sierra and kin will require either XL or GCC 7.2. Furthermore, our early experience with GCC 7.2 is that it is stricter about accepting code (e.g., requiring the template keyword when calling templated methods). This means that it would help to have GCC 7.2 builds on the Dashboard.

@trilinos/framework @micahahoward

Expectations

Trilinos has a GCC 7.2 Dashboard build, at least without CUDA for now, that builds all the solver packages.

Current Behavior

The latest GCC version currently exercised on the Dashboard is 5.2.

Motivation and Context

Sierra and kin need either XL or GCC 7.2 builds. Applications are already testing with GCC 7.2.

Tasks

  1. Install modules in SEMS `atdm-env project space [DONE]
  2. Create a simple custom Trilinos/cmake/std/sems/atdm/load_gcc_7.2_env.sh script and then use the SEMSEnv.cmake module to test Trilinos [DONE]
  3. Install BLAS and LAPACK from source and update Trilinos/cmake/std/sems/atdm/load_gcc_7.2_env.sh and SEMSEnv.cmake to pull in new BLAS and LAPACK ... Does not resolve the link warnings due to superlu so we will not try to resolve right now and just keep the link warnings for now [SKIPPED]
  4. Set up a CTest -S driver script to submit an Nightly build with this ATDM GCC 7.2.0 env to Trilinos CDash site ... (Ross)
  5. Run new CTest -S driver script from a the SEMS SON or SRN Jenkins build farm or the ASCIC Jenkins build farm that posts to the Nightly Track/Group ...
  6. Determine what upgrades of MPI and TPLs should be done for updated build better targeting SPARC and EMPIRE ...
  7. Add an additional build of Trilinos with GCC 7.2.0 and updated TPLs posting to CDash ...
nmhamster commented 6 years ago

@mhoemmen - for POWER we may need to specifically use IBM's Advance Toolchain which provides an optimized and patched GCC 7.2. I agree that early 7.2 testing would be useful.

mhoemmen commented 6 years ago

@micahahoward specifically mentioned IBM's Toolchain. He has been testing with stock GCC 7.2, no CUDA, just to make sure that stuff builds.

bartlettroscoe commented 6 years ago

@mhoemmen, I think that SEMS would need to add a GCC 7.2 build for the SEMS dev env before this could happen. I will not this specific request in:

bartlettroscoe commented 6 years ago

The SEMS issue to add a GCC 7.2 build of the env is:

mhoemmen commented 6 years ago

Thanks @bartlettroscoe ! :-D I'll let @micahahoward know.

mhoemmen commented 6 years ago

@rrdrake and @prwolfe may also be interested.

bartlettroscoe commented 6 years ago

@mhoemmen, specifically what machine and env is this customer trying to build Trilinos with GCC 7.2?

bartlettroscoe commented 6 years ago

@mhoemmen, how urgent? Would the GCC 7.2 modules on the CEE LAN work for setting up this testing with the ASC IC Jenkins build farm? If not, then SEMS would need to install a GCC 7.2 stack.

mhoemmen commented 6 years ago

@bartlettroscoe SPARC is the customer; ATS-2 the target machine. I'll let @micahahoward answer questions about urgency. Thanks!

rppawlo commented 6 years ago

So we've come across a bug in gcc 5 and 6 that breaks our use of hierarchic parallelism in kokkos. Having a sems 7.2 module would be valuable to EMPIRE/Drekar/Charon2/Panzer teams as we migrate towards hierarchic parallelism. The workaround is a very ugly hack in the source code. No pressure on a timeline as we will use the hack for now.

micahahoward commented 6 years ago

@mhoemmen , @nmhamster , @bartlettroscoe : Re urgency: This is forward looking and not that urgent. It's driven by wanting to be ready for the IBM Advanced Toolchain when it's ready to go on ATS-2-like testbeds. Having Trilinos tested with GCC 7.2 by early Jan would be good, which means starting to do that now is about right.

FWIW, I was able to build our configuration of Trilinos with 7.2 and get all of our application code tests to pass without any significant issues. Having some assurance that remains true with what I'd like to have.

bartlettroscoe commented 6 years ago

@micahahoward, @mhoemmen,

Having Trilinos tested with GCC 7.2 by early Jan would be good, which means starting to do that now is about right.

Okay then. I will put in an offical SEMS request to set up a GCC 7.2 env and put a place-holder issue for this in

I will request that this env be available by early Jan.

nmhamster commented 6 years ago

@bartlettroscoe @mhoemmen @micahahoward - Blake (Sky Lake) test bed, has a very initial GCC 7.2.0 test bed environment for Trilinos builds. Note that #2041 is filed for a Teuchos bug I found this weekend.

nmhamster commented 6 years ago

And @bartlettroscoe - I think there is a TriBITS bug (OpenMP Detection for Fortran when using GCC 7.2.0 Fails #244, https://github.com/TriBITSPub/TriBITS/issues/244) I have raised as well but this may not be connected to GCC 7.2.0.

mhoemmen commented 6 years ago

@nmhamster git blame claims that @hkthorn fixed the missing #include <vector> on 29 Nov.

nmhamster commented 6 years ago

I ran a set of tests from the GCC 7.2.0 build on Blake overnight. The following failed:

Label Time Summary:
Amesos       =   4.87 sec (9 tests)
Amesos2      =   6.97 sec (9 tests)
AztecOO      =   6.29 sec (11 tests)
Belos        =  92.86 sec (64 tests)
Epetra       =  24.80 sec (52 tests)
EpetraExt    =   6.41 sec (10 tests)
Ifpack       =  23.98 sec (41 tests)
Ifpack2      =  40.68 sec (35 tests)
Kokkos       =  47.64 sec (23 tests)
ML           =   9.81 sec (16 tests)
MueLu        = 598.66 sec (80 tests)
SEACAS       =   9.34 sec (7 tests)
STK          =   1.41 sec (2 tests)
Shards       =   0.18 sec (1 test)
Teuchos      =  71.16 sec (126 tests)
Tpetra       = 200.37 sec (135 tests)
Zoltan       = 473.94 sec (26 tests)
Zoltan2      =  67.39 sec (94 tests)

Total Test time (real) = 1686.72 sec

The following tests FAILED:
    137 - TeuchosNumerics_LAPACK_test_MPI_1 (Not Run)
    646 - Ifpack2_BlockTriDiContainerUnitAndPerfTests_MPI_4 (Failed)
    647 - Ifpack2_BlockRelaxationPerformance_MPI_1 (Failed)

Verbose output from the failing tests is below. We know the Teuchos numerics break because of issue #2041.

ctest -R Ifpack2_BlockTriDiContainerUnitAndPerfTests_MPI_4 -V
UpdateCTestConfiguration  from :/ascldap/users/sdhammo/git/trilinos-github-repo/build-devpack-gcc-720/DartConfiguration.tcl
Parse Config file:/ascldap/users/sdhammo/git/trilinos-github-repo/build-devpack-gcc-720/DartConfiguration.tcl
 Add coverage exclude regular expressions.
SetCTestConfiguration:CMakeCommand:/home/projects/x86-64/cmake/3.9.0/bin/cmake
UpdateCTestConfiguration  from :/home/sdhammo/git/trilinos-github-repo/build-devpack-gcc-720/DartConfiguration.tcl
Parse Config file:/home/sdhammo/git/trilinos-github-repo/build-devpack-gcc-720/DartConfiguration.tcl
Test project /ascldap/users/sdhammo/git/trilinos-github-repo/build-devpack-gcc-720
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 646
    Start 646: Ifpack2_BlockTriDiContainerUnitAndPerfTests_MPI_4

646: Test command: /ascldap/users/projects/x86-64-skylake/openmpi/2.1.2/gcc/7.2.0/bin/mpiexec "-np" "4" "--map-by" "numa:PE=4" "/home/sdhammo/git/trilinos-github-repo/build-devpack-gcc-720/packages/ifpack2/test/unit_tests/Ifpack2_BlockTriDiContainerUnitAndPerfTests.exe"
646: Test timeout computed to be: 1500
646: Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
646:   In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
646:   For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
646:   For unit testing set OMP_PROC_BIND=false
646: Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
646:   In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
646:   For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
646:   For unit testing set OMP_PROC_BIND=false
646: Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
646:   In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
646:   For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
646:   For unit testing set OMP_PROC_BIND=false
646: Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
646:   In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
646:   For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
646:   For unit testing set OMP_PROC_BIND=false
646: <I> nranks 4 ni 10 nj 10 nk 10 bs 5 nrhs 1 isplit 4 jsplit 1 nthreads 8
646:
646: ***
646: *** Unit test suite ...
646: ***
646:
646:
646: Sorting tests by group name then by the order they were added ... (time = 5.95e-06)
646:
646: Running unit tests ...
646:
646: [node01:36640] *** Process received signal ***
646: [node01:36638] *** Process received signal ***
646: [node01:36639] *** Process received signal ***
646: --------------------------------------------------------------------------
646: mpiexec noticed that process rank 1 with PID 36638 on node node01 exited on signal 11 (Segmentation fault).
646: --------------------------------------------------------------------------
1/1 Test #646: Ifpack2_BlockTriDiContainerUnitAndPerfTests_MPI_4 ...***Failed  Required regular expression not found.Regex=[End Result: TEST PASSED
]  3.52 sec

0% tests passed, 1 tests failed out of 1

Label Time Summary:
Ifpack2    =   3.52 sec (1 test)

Total Test time (real) =   4.20 sec

The following tests FAILED:
    646 - Ifpack2_BlockTriDiContainerUnitAndPerfTests_MPI_4 (Failed)
Errors while running CTest

And:

ctest -R Ifpack2_BlockRelaxationPerformance_MPI_1 -V
UpdateCTestConfiguration  from :/ascldap/users/sdhammo/git/trilinos-github-repo/build-devpack-gcc-720/DartConfiguration.tcl
Parse Config file:/ascldap/users/sdhammo/git/trilinos-github-repo/build-devpack-gcc-720/DartConfiguration.tcl
 Add coverage exclude regular expressions.
SetCTestConfiguration:CMakeCommand:/home/projects/x86-64/cmake/3.9.0/bin/cmake
UpdateCTestConfiguration  from :/home/sdhammo/git/trilinos-github-repo/build-devpack-gcc-720/DartConfiguration.tcl
Parse Config file:/home/sdhammo/git/trilinos-github-repo/build-devpack-gcc-720/DartConfiguration.tcl
Test project /ascldap/users/sdhammo/git/trilinos-github-repo/build-devpack-gcc-720
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 647
    Start 647: Ifpack2_BlockRelaxationPerformance_MPI_1

647: Test command: /ascldap/users/projects/x86-64-skylake/openmpi/2.1.2/gcc/7.2.0/bin/mpiexec "-np" "1" "--map-by" "numa:PE=4" "/home/sdhammo/git/trilinos-github-repo/build-devpack-gcc-720/packages/ifpack2/test/unit_tests/Ifpack2_BlockRelaxationPerformance.exe"
647: Test timeout computed to be: 1500
647: Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name node01.blake.sandia.gov and rank 0!
647:
647: ***
647: *** Unit test suite ...
647: ***
647:
647:
647: Sorting tests by group name then by the order they were added ... (time = 2.97e-06)
647:
647: Running unit tests ...
647:
647: 0. Ifpack2BlockRelaxation_Performance_UnitTest ... Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
647:   In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
647:   For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
647:   For unit testing set OMP_PROC_BIND=false
647:
647:  Testing Ifpack2 block G-S
647:      Testing block size: 2
647:  (24 trials)
647:      Testing block size: 4
647:  (44 trials)
647:      Testing block size: 5
647:  (56 trials)
647:      Testing block size: 8
647:  (82 trials)
647:      Testing block size: 10
647:  (101 trials)
647:      Testing block size: 20
647:
647:  p=0: *** Caught standard std::exception of type 'std::runtime_error' :
647:
647:   /home/sdhammo/git/trilinos-github-repo/packages/ifpack2/src/Ifpack2_TriDiContainer_def.hpp:259:
647:
647:   Throw number = 1
647:
647:   Throw test that evaluated to true: INFO > 0
647:
647:   Ifpack2::TriDiContainer::factor: LAPACK's _GTTRF (LU factorization with partial pivoting) reports that the computed U factor is exactly singular.  U(1,1) (one-based index i) is exactly zero.  This probably means that the input matrix has a singular diagonal block.
647:  [FAILED]  (3.97 sec) Ifpack2BlockRelaxation_Performance_UnitTest
647:  Location: /home/sdhammo/git/trilinos-github-repo/packages/ifpack2/test/unit_tests/Ifpack2_UnitTestBlockRelaxationPerf.cpp:107
647:
647:
647: The following tests FAILED:
647:     0. Ifpack2BlockRelaxation_Performance_UnitTest ...
647:
647: Total Time: 3.97 sec
647:
647: Summary: total = 1, run = 1, passed = 0, failed = 1
647:
647: End Result: TEST FAILED
647: -------------------------------------------------------
647: Primary job  terminated normally, but 1 process returned
647: a non-zero exit code.. Per user-direction, the job has been aborted.
647: -------------------------------------------------------
647: --------------------------------------------------------------------------
647: mpiexec detected that one or more processes exited with non-zero status, thus causing
647: the job to be terminated. The first process to do so was:
647:
647:   Process name: [[37086,1],0]
647:   Exit code:    1
647: --------------------------------------------------------------------------
1/1 Test #647: Ifpack2_BlockRelaxationPerformance_MPI_1 ...***Failed    6.71 sec

0% tests passed, 1 tests failed out of 1

Label Time Summary:
Ifpack2    =   6.71 sec (1 test)

Total Test time (real) =   7.19 sec

The following tests FAILED:
    647 - Ifpack2_BlockRelaxationPerformance_MPI_1 (Failed)
Errors while running CTest
mhoemmen commented 6 years ago

@ambrad Huh, is that your Ifpack2 code? I can help.

nmhamster commented 6 years ago

For Sky Lake we will also need to address issue #2050.

bartlettroscoe commented 6 years ago

FYI: I created the targeted issue:

to make sure we target this in the CDOFA process.

bartlettroscoe commented 6 years ago

A GCC 7.2.0 build env was installed by @fryeguy52 on the SEMS NFS mount under the atdm project area. I set up a simple shell script Trilinos/cmake/std/atdm/load_atdm_7.2_dev_env.sh in the branch atdm-gcc-7.2.0-2028 pushed to the GitHub rerpo git@github.com:bartlettroscoe/Trilinos.git. I tried the full build of all of the PT packages. The configure intially failed due to what seems to be a defect in FIND_LIBRARY() in CMake 3.10.0 (see details below). But with CMake 3.5.2, the configure passed. With this initial env, the build failed with a build failure in the EpetraEx HDF5 adapters. This build was posted to CDash at:

This one build failure resulted in a lot of test link failures so a lot of tests never even ran. Since the ATDM build of Trilinos turns off HDF5 support in EpetraExt, I will disable it and try again.

DETAILED NOTES (Click to expand) I created the Trilios branch `rab-github atdm-gcc-7.2.0-2028` with the shell script `Trilinos/cmake/std/atdm/load_atdm_7.2_dev_env.sh` and pushed to my fork of Trilinos: ``` To github.com:bartlettroscoe/Trilinos.git * [new branch] atdm-gcc-7.2.0-2028 -> atdm-gcc-7.2.0-2028 ``` I then used the simple `do-configure` script: ``` #!/bin/bash cmake \ -DBUILD_SHARED_LIBS=ON \ -DTrilinos_ENABLE_TESTS:BOOL=ON \ -DDART_TESTING_TIMEOUT:STRING=300.0 \ -DTrilinos_DISABLE_ENABLED_FORWARD_DEP_PACKAGES=ON \ -DTrilinos_ENABLE_SECONDARY_TESTED_CODE:BOOL=OFF \ -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/MpiReleaseDebugSharedPtSettings.cmake,cmake/std/BasicCiTestingSettings.cmake \ -DCTEST_BUILD_FLAGS=-j10 \ -DCTEST_PARALLEL_LEVEL=10 \ "$@" \ ../../../Trilinos ``` and did the configure of Trilinos on ceerws1113 with: ``` $ cd /scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/MPI_RELEASE_DEBUG_SHARED_PT/ $ module purge $ source /scratch/rabartl/Trilinos.base/Trilinos/cmake/std/atdm/load_atdm_7.2_dev_env.sh $ module list Currently Loaded Modulefiles: 1) sems-env 8) atdm-boost/1.63.0/atdm 2) atdm-env 9) atdm-zlib/1.2.8/atdm 3) sems-python/2.7.9 10) atdm-hdf5/1.8.12/atdm 4) atdm-cmake/3.10.1 11) atdm-netcdf/4.4.1/atdm 5) sems-git/2.10.1 12) atdm-parmetis/4.0.3/atdm 6) atdm-gcc/7.2.0 13) atdm-scotch/6.0.3/atdm 7) atdm-openmpi/1.6.5/atdm 14) atdm-superlu/4.3/atdm $ ./do-configure -DTrilinos_ENABLE_ALL_PACKAGES=ON &> configure.out ``` This resulted in the configure failure: ``` Processing enabled TPL: Scotch (enabled explicitly, disable with -DTPL_ENABLE_Scotch=OFF) -- Scotch_LIBRARY_NAMES='ptscotch;ptscotcherr;scotch;scotcherr' -- Searching for libs in Scotch_LIBRARY_DIRS='/projects/sems/install/rhel6-x86_64/atdm/tpl/scotch/6.0.3/gcc/7.2.0/openmpi/1.6.5/atdm/lib}' -- Searching for a lib in the set "ptscotch": -- Searching for lib 'ptscotch' ... -- NOTE: Did not find a lib in the lib set "ptscotch" for the TPL 'Scotch'! -- ERROR: Could not find the libraries for the TPL 'Scotch'! -- TIP: If the TPL 'Scotch' is on your system then you can set: -DScotch_LIBRARY_DIRS=';;...' to point to the directories where these libraries may be found. Or, just set: -DTPL_Scotch_LIBRARIES=';;...' to point to the full paths for the libraries which will bypass any search for libraries and these libraries will be used without question in the build. (But this will result in a build-time error if not all of the necessary symbols are found.) -- ERROR: Failed finding all of the parts of TPL 'Scotch' (see above), Aborting! -- Performing Test HAVE_SCOTCH_VERSION_6_0_3 -- Performing Test HAVE_SCOTCH_VERSION_6_0_3 - Success -- NOTE: The find module file for this failed TPL 'Scotch' is: /scratch/rabartl/Trilinos.base/Trilinos/cmake/TPLs/FindTPLScotch.cmake which is pointed to in the file: /scratch/rabartl/Trilinos.base/Trilinos/TPLsList.cmake TIP: Even though the TPL 'Scotch' was explicitly enabled in input, it can be disabled with: -DTPL_ENABLE_Scotch=OFF which will disable it and will recursively disable all of the downstream packages that have required dependencies on it. When you reconfigure, just grep the cmake stdout for 'Scotch' and then follow the disables that occur as a result to see what impact this TPL disable has on the configuration of Trilinos. CMake Error at cmake/tribits/core/package_arch/TribitsProcessEnabledTpl.cmake:144 (MESSAGE): ERROR: TPL_Scotch_NOT_FOUND=TRUE, aborting! Call Stack (most recent call first): cmake/tribits/core/package_arch/TribitsGlobalMacros.cmake:1689 (TRIBITS_PROCESS_ENABLED_TPL) cmake/tribits/core/package_arch/TribitsProjectImpl.cmake:202 (TRIBITS_PROCESS_ENABLED_TPLS) cmake/tribits/core/package_arch/TribitsProject.cmake:93 (TRIBITS_PROJECT_IMPL) CMakeLists.txt:93 (TRIBITS_PROJECT) -- Configuring incomplete, errors occurred! See also "/scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/MPI_RELEASE_DEBUG_SHARED_PT/CMakeFiles/CMakeOutput.log". ``` Looking at: ``` $ echo /projects/sems/install/rhel6-x86_64/atdm/tpl/scotch/6.0.3/gcc/7.2.0/openmpi/1.6.5/atdm/lib $ ls -w 50 $SEMS_SCOTCH_LIBRARY_PATH libptscotch.a libscotch.a libptscotcherr.a libscotcherr.a libptscotcherrexit.a libscotcherrexit.a libptscotchparmetis.a libscotchmetis.a ``` So it turns out that this configure failure with CMake 3.10.0 does not occur when loading CMake 3.5.2. This looks to be the same CMake defect that I had to hack the fix for 0960c822a186216ef394fee5b8e9efce50d7585c. So it looks like this is the same CMake defect. We need to get Kitware to fix this. Otherwise, we will not be able to do that all-at-once build of Trilinos! I addressed this, for now, by just using CMake 3.5.2 and updated the script: ``` To github.com:bartlettroscoe/Trilinos.git a7a4273..c677baf atdm-gcc-7.2.0-2028 -> atdm-gcc-7.2.0-2028 ``` The configure of Trilinos then passed using CMake 3.5.2. However, the build failed with: ``` /scratch/rabartl/Trilinos.base/Trilinos/packages/epetraext/src/inout/EpetraExt_HDF5.cpp: In member function ‘void EpetraExt::HDF5::Create(std::__cxx11::string)’: /scratch/rabartl/Trilinos.base/Trilinos/packages/epetraext/src/inout/EpetraExt_HDF5.cpp:347:5: error: ‘H5Pset_fapl_mpio’ was not declared in this scope H5Pset_fapl_mpio(plist_id_, mpiComm, MPI_INFO_NULL); ^~~~~~~~~~~~~~~~ /scratch/rabartl/Trilinos.base/Trilinos/packages/epetraext/src/inout/EpetraExt_HDF5.cpp:347:5: note: suggested alternative: ‘H5Pset_fapl_stdio’ H5Pset_fapl_mpio(plist_id_, mpiComm, MPI_INFO_NULL); ^~~~~~~~~~~~~~~~ H5Pset_fapl_stdio /scratch/rabartl/Trilinos.base/Trilinos/packages/epetraext/src/inout/EpetraExt_HDF5.cpp: In member function ‘void EpetraExt::HDF5::Open(std::__cxx11::string, int)’: /scratch/rabartl/Trilinos.base/Trilinos/packages/epetraext/src/inout/EpetraExt_HDF5.cpp:382:5: error: ‘H5Pset_fapl_mpio’ was not declared in this scope H5Pset_fapl_mpio(plist_id_, MPI_COMM_WORLD, MPI_INFO_NULL); ^~~~~~~~~~~~~~~~ ``` and other errors like this in this file. That may not be all of the build failures but that is at least one. I am now submitting to CDash to show the full set of build errors with: ``` $ time env Trilinos_CTEST_DO_ALL_AT_ONCE=TRUE make dashboard &> make.dashboard.out real 35m50.519s user 55m41.719s sys 12m35.309s ``` This posted results to: * https://testing.sandia.gov/cdash/index.php?project=Trilinos&display=project&filtercount=1&showfilters=1&field1=buildstamp&compare1=61&value1=20180105-0146-Experimental This shows just a single *.cpp file build failure at: * https://testing.sandia.gov/cdash/viewBuildError.php?buildid=3312863 which is the same build failure shown above. But note that `EpetraExt_ENABLE_HDF5=OFF` is set in the ATDM build of Trilinos (i.e. used by EMPIRE). Therefore, we could likely disable HDF5 support in Trilinos. Therefore, I will try building with `-DEpetraExt_ENABLE_HDF5=OFF` and see what happens.
mhoemmen commented 6 years ago

@bartlettroscoe Awesome!!! Thanks so much for setting this up!!! :-D

bartlettroscoe commented 6 years ago

FYI: I created the issue https://gitlab.kitware.com/snl/project-1/issues/47 to see if Kitware can look into the apparent defect with FIND_LIBRARY() with CMake 3.10 and get this fixed. Otherwise, we will need to explicitly list the full set of TPL libraries for SCOTCH and perhaps other TPLs as well to get around this. Otherwise, we will not be able to use the all-at-once configure, build, and test and submit nicely to the new CDash site.

bartlettroscoe commented 6 years ago

I reran the full Trilinos CI PT build again with the GCC 7.2.0 env with -DEpetraExt_ENABLE_HDF5=OFF and the build passed and all of the tests passed as shown at:

However, it produces a ton of link warnings that say:

/usr/bin/ld: warning: libgfortran.so.3, needed by /usr/lib64/liblapack.so, may conflict with libgfortran.so.4

I know that the SEMS policy is not to build and install BLAS and LAPACK but it looks like that might be necassary with newer versions of GCC.

@fryeguy52,

Can you please look into adding the install of BLAS/LAPACK to this ATDM SEMS GCC 7.2.0 env? I think that would eliminate these link warnings. In any cas, you should have the exact reproducability instructions.

I went ahead and moved the script and pushed in the commit 259322e9a92b9ad8019f9e3f1c945ed51a487684:

commit 259322e9a92b9ad8019f9e3f1c945ed51a487684
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date:   Thu Jan 4 14:02:43 2018 -0700

    Add env script for SEMS ATDM GCC 7.2 env (#2028)

    This is a drop-in replacement for load_sems_dev_env.sh to be able to configure
    and build Trilinos, but with a custom set of modules.  Just soruce this and
    then you can use the SEMSEnv.cmake module.

    NOTE: Had to switch from CMake 3.10.0 to CMake 3.5.1 to avoid FIND_LIBRARY()
    defect (#2028).

    We will need to get Kitware to fix that defect otherwise will not be able to
    use the all-at-once configure, build, test, and submit to CDash :-(

    Or, we can just hack the SEMSEnv.cmake file to avoid the find like in commit
    0960c822a186216ef394fee5b8e9efce50d7585c.

A       cmake/std/sems/atdm/load_atdm_7.2_dev_env.sh
DETAILED NOTES (Click to expand) Doing this over again, this time disable HDF5 support in EpetraExt on `ceerws1113` I did: ``` $ cd /scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/MPI_RELEASE_DEBUG_SHARED_PT/ $ module purge $ source /scratch/rabartl/Trilinos.base/Trilinos/cmake/std/atdm/load_atdm_7.2_dev_env.sh $ module list Currently Loaded Modulefiles: 1) sems-env 2) atdm-env 3) sems-python/2.7.9 4) sems-git/2.10.1 5) atdm-gcc/7.2.0 6) atdm-openmpi/1.6.5/atdm 7) atdm-boost/1.63.0/atdm 8) atdm-zlib/1.2.8/atdm 9) atdm-hdf5/1.8.12/atdm 10) atdm-netcdf/4.4.1/atdm 11) atdm-parmetis/4.0.3/atdm 12) atdm-scotch/6.0.3/atdm 13) atdm-superlu/4.3/atdm 14) sems-cmake/3.5.2 $ rm -r CMake* $ ./do-configure \ -DEpetraExt_ENABLE_HDF5=OFF -DTrilinos_ENABLE_ALL_PACKAGES=ON \ &> configure.out ``` I then submitted to CDash with: ``` $ time env Trilinos_CTEST_DO_ALL_AT_ONCE=TRUE \ make dashboard &> make.dashboard.out real 71m29.015s user 227m51.109s sys 22m10.899s ``` and the results showed up on CDash at: * https://testing-vm.sandia.gov/cdash/index.php?project=Trilinos&date=2018-01-05&filtercount=1&showfilters=1&field1=buildstamp&compare1=61&value1=20180105-1357-Experimental This shows 100% passing build and 2605 passing tests! However, it shows a bunch of warnings of the form: ``` /usr/bin/ld: warning: libgfortran.so.3, needed by /usr/lib64/liblapack.so, may conflict with libgfortran.so.4 ``` I think this means that the GCC 7.2.0 env should include BLAS and LAPACK build against libforgran.so.4 (which comes with GCC 7.2.0?) instead of relying on the RHEL6 installed BLAS and LAPACK under /usr/lib64.
bartlettroscoe commented 6 years ago

@fryeguy52,

It just occurred to me that an easy way to install BLAS and LAPACK it use SPACK:

I just did this with:

$ cd $HOME/SPARC.base/
$ git clone git@github.com:spack/spack.git
$ cd spack/bin
./spack install lapack -j16
==> Installing openblas
==> Using cached archive: /home/rabartl/SPACK.base/spack/var/spack/cache/openblas/openblas-0.2.20.tar.gz
==> Staging archive: /home/rabartl/SPACK.base/spack/var/spack/stage/openblas-0.2.20-foilt3stjn4aqxpsk3asyanbjphwcuwb/v0.2.20.tar.gz
==> Created stage in /home/rabartl/SPACK.base/spack/var/spack/stage/openblas-0.2.20-foilt3stjn4aqxpsk3asyanbjphwcuwb
==> Applied patch make.patch
==> Building openblas [MakefilePackage]
==> Executing phase: 'edit'
==> Executing phase: 'build'
==> Executing phase: 'install'
==> Successfully installed openblas
  Fetch: 0.02s.  Build: 8m 34.39s.  Total: 8m 34.41s.
[+] /home/rabartl/SPACK.base/spack/opt/spack/linux-rhel6-x86_64/gcc-4.8.3/openblas-0.2.20-foilt3stjn4aqxpsk3asyanbjphwcuwb

I need to do this with the ATDM SEMS GCC 7.2.0 module loaded first and then test that it works with our build of Trilinos but my guess is that will work.

At the very least you can look at what SPACK does to install BLAS and LAPACK and then duplicate this in the SEMS TPL installer infrastructure. Hopefully that is not too hard.

bartlettroscoe commented 6 years ago

I created an official SEMS request to install BLAS and LAPACK for GCC 7.2.0 in:

But we might just look into calling SPACK? But SPACK has its own ideas of a directory structure that may not be compatible with the SEMS way of installing these so it might not be as easy as just calling SPACK.

bartlettroscoe commented 6 years ago

@fryeguy52,

I installed OpenBLAS using SPACK for GCC 7.2.0 on my local machine as described below and seems to work without generating any link warnings about libgfortran.so version incompatibilities.

Can you just run SPACK inside of the NFS mounded drive under the atdm project area as I demonstrated below (see details) and then repeat the test build I show below? Let me know how that goes. If it works, then I will update the script Trilinos/cmake/std/sems/atdm/load_atdm_7.2_dev_env.sh and SEMSDevEnv.cmake to use that version of BLAS and LAPACK. Then we can get a nightly build going for Trilinos with this version.

I will go ahead and do a full nighlty build of Trilinos and submit to CDash.

DETAILED NOTES (Click to expand) I will try to use SPARC to install openblas BLAS/LAPACK with GCC 7.2.0 and see if it resolves the link warnings with libgfortran.so. So on ceerws1113, I do: ``` $ cd /scratch/rabartl/ $ mkdir SPACK.base $ cd SPACK.base $ git clone git@github.com:spack/spack.git $ cd spack/bin $ source \ /scratch/rabartl/Trilinos.base/Trilinos/cmake/std/sems/atdm/load_atdm_7.2_dev_env.sh $ ./spack install lapack -j16 ==> Installing openblas ==> Fetching http://github.com/xianyi/OpenBLAS/archive/v0.2.20.tar.gz ######################################################################## 100.0% ==> Staging archive: /scratch/rabartl/SPACK.base/spack/var/spack/stage/openblas-0.2.20-ex5gwnefywt4wbfyhuehdeh3ds6kj63q/v0.2.20.tar.gz ==> Created stage in /scratch/rabartl/SPACK.base/spack/var/spack/stage/openblas-0.2.20-ex5gwnefywt4wbfyhuehdeh3ds6kj63q ==> Applied patch make.patch ==> Building openblas [MakefilePackage] ==> Executing phase: 'edit' ==> Executing phase: 'build' ==> Executing phase: 'install' ==> Successfully installed openblas Fetch: 0.03s. Build: 15m 21.55s. Total: 15m 21.58s. [+] /scratch/rabartl/SPACK.base/spack/opt/spack/linux-rhel6-x86_64/gcc-7.2.0/openblas-0.2.20-ex5gwnefywt4wbfyhuehdeh3ds6kj63q ``` Now to configure Trilinos with ATDM SEMS GCC 7.2.0 env but use that BLAS and LAPACK on ceerws1113. I will start with just Teuchos with: ``` $ cd /scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/MPI_RELEASE_DEBUG_SHARED_PT $ module purge $ source /scratch/rabartl/Trilinos.base/Trilinos/cmake/std/sems/atdm/load_atdm_7.2_dev_env.sh $ module list Currently Loaded Modulefiles: 1) sems-env 8) atdm-zlib/1.2.8/atdm 2) atdm-env 9) atdm-hdf5/1.8.12/atdm 3) sems-python/2.7.9 10) atdm-netcdf/4.4.1/atdm 4) sems-git/2.10.1 11) atdm-parmetis/4.0.3/atdm 5) atdm-gcc/7.2.0 12) atdm-scotch/6.0.3/atdm 6) atdm-openmpi/1.6.5/atdm 13) atdm-superlu/4.3/atdm 7) atdm-boost/1.63.0/atdm 14) sems-cmake/3.5.2 $ rm -r CMake* export SPACK_LAPACK_LIB_DIR=/scratch/rabartl/SPACK.base/spack/opt/spack/linux-rhel6-x86_64/gcc-7.2.0/openblas-0.2.20-ex5gwnefywt4wbfyhuehdeh3ds6kj63q/lib $ time ./do-configure -DTrilinos_ENABLE_Teuchos=ON \ -DTPL_BLAS_LIBRARIES=$SPACK_LAPACK_LIB_DIR/libopenblas_sandybridge-r0.2.20.so \ -DTPL_LAPACK_LIBRARIES=$SPACK_LAPACK_LIB_DIR/libopenblas_sandybridge-r0.2.20.so \ &> configure.out real 0m20.497s user 0m10.358s sys 0m4.477s ``` This showed in the configure output: ``` Processing enabled TPL: BLAS (enabled explicitly, disable with -DTPL_ENABLE_BLAS=OFF) -- BLAS_LIBRARY_NAMES='blas blas_win32' -- TPL_BLAS_LIBRARIES='/scratch/rabartl/SPACK.base/spack/opt/spack/linux-rhel6-x86_64/gcc-7.2.0/openblas-0.2.20-ex5gwnefywt4wbfyhuehdeh3ds6kj63q/lib/libopenblas_sandybridge-r0.2.20.so' Processing enabled TPL: LAPACK (enabled explicitly, disable with -DTPL_ENABLE_LAPACK=OFF) -- LAPACK_LIBRARY_NAMES='lapack lapack_win32' -- TPL_LAPACK_LIBRARIES='/scratch/rabartl/SPACK.base/spack/opt/spack/linux-rhel6-x86_64/gcc-7.2.0/openblas-0.2.20-ex5gwnefywt4wbfyhuehdeh3ds6kj63q/lib/libopenblas_sandybridge-r0.2.20.so' ``` That should use the right BLAS and LAPACK that I want. Now to build an executable with BLAS and LAPACK and see what happens: ``` $ cd packages/teuchos/numerics/test/LAPACK/ $ make VERBOSE=1 -j16 TeuchosNumerics_LAPACK_test [...] [100%] Linking CXX executable TeuchosNumerics_LAPACK_test.exe cd /scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/MPI_RELEASE_DEBUG_SHARED_PT/packages/teuchos/numerics/test/LAPACK && /projects/sems/install/rhel6-x86_64/sems/utility/cmake/3.5.2/bin/cmake -E cmake_link_script CMakeFiles/TeuchosNumerics_LAPACK_test.dir/link.txt --verbose=1 /projects/sems/install/rhel6-x86_64/atdm/compiler/gcc/7.2.0/openmpi/1.6.5/bin/mpicxx -pedantic -Wall -Wno-long-long -Wwrite-strings -Wshadow -Woverloaded-virtual -g -std=c++11 -O3 -DNDEBUG CMakeFiles/TeuchosNumerics_LAPACK_test.dir/cxx_main.cpp.o -o TeuchosNumerics_LAPACK_test.exe -rdynamic ../../src/libteuchosnumerics.so.12.13 ../../../../../liblast_lib.a ../../../comm/src/libteuchoscomm.so.12.13 ../../../parameterlist/src/libteuchosparameterlist.so.12.13 ../../../parser/src/libteuchosparser.so.12.13 ../../../core/src/libteuchoscore.so.12.13 ../../../../kokkos/core/src/libkokkoscore.so.12.13 /usr/lib64/libdl.so /scratch/rabartl/SPACK.base/spack/opt/spack/linux-rhel6-x86_64/gcc-7.2.0/openblas-0.2.20-ex5gwnefywt4wbfyhuehdeh3ds6kj63q/lib/libopenblas_sandybridge-r0.2.20.so -lgomp -lgfortran -ldl -Wl,-rpath,/scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/MPI_RELEASE_DEBUG_SHARED_PT/packages/teuchos/numerics/src:/scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/MPI_RELEASE_DEBUG_SHARED_PT/packages/teuchos/comm/src:/scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/MPI_RELEASE_DEBUG_SHARED_PT/packages/teuchos/parameterlist/src:/scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/MPI_RELEASE_DEBUG_SHARED_PT/packages/teuchos/parser/src:/scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/MPI_RELEASE_DEBUG_SHARED_PT/packages/teuchos/core/src:/scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/MPI_RELEASE_DEBUG_SHARED_PT/packages/kokkos/core/src:/scratch/rabartl/SPACK.base/spack/opt/spack/linux-rhel6-x86_64/gcc-7.2.0/openblas-0.2.20-ex5gwnefywt4wbfyhuehdeh3ds6kj63q/lib make[3]: Leaving directory `/scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/MPI_RELEASE_DEBUG_SHARED_PT' [100%] Built target TeuchosNumerics_LAPACK_test make[2]: Leaving directory `/scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/MPI_RELEASE_DEBUG_SHARED_PT' /projects/sems/install/rhel6-x86_64/sems/utility/cmake/3.5.2/bin/cmake -E cmake_progress_start /scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/MPI_RELEASE_DEBUG_SHARED_PT/CMakeFiles 0 make[1]: Leaving directory `/scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/MPI_RELEASE_DEBUG_SHARED_PT' ``` See, no link warnings! And the test runs and passes: ``` ./TeuchosNumerics_LAPACK_test.exe -v Teuchos in Trilinos 12.13 (Dev) GESV test ... passed! LAPY2 test ... passed! STEQR test ... Passed! ( Lambda min: expected 1, computed 1; Lambda max: expected 1031, computed 1031) ILAENV test ... passed! End Result: TEST PASSED ``` Yea for SPACK!
bartlettroscoe commented 6 years ago

Interestingly, when I updated Trilinos and did the full build again, I am already getting a build error in Kokkos with GCC 7.2.0 as shown in:

where the build error shows:

/scratch/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/unit_test/UnitTest_PushFinalizeHook.cpp: In function ‘int main(int, char**)’:
/scratch/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/unit_test/UnitTest_PushFinalizeHook.cpp:136:10: error: no match for ‘operator<<’ (operand types are ‘std::basic_ostream<char>::__ostream_type {aka std::basic_ostream<char>}’ and ‘std::ostringstream {aka std::__cxx11::basic_ostringstream<char>}’)
     cout << "FAILED:" << endl
     ~~~~~~~~~~~~~~~~~~~~~~~~~
          << "  Expected output:" << endl
          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          << expectedOutput << endl
          ~~~~~~~~~~~~~~~~~~~~~~~~~
          << "  Actual output:" << endl
          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          << hookOutput << endl;
          ^~~~~~~~~~~~~
/scratch/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/unit_test/UnitTest_PushFinalizeHook.cpp:136:10: note: candidate: operator<<(int, int) <built-in>
/scratch/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/unit_test/UnitTest_PushFinalizeHook.cpp:136:10: note:   no known conversion for argument 2 from ‘std::ostringstream {aka std::__cxx11::basic_ostringstream<char>}’ to ‘int’
In file included from /projects/sems/install/rhel6-x86_64/atdm/compiler/gcc/7.2.0/base/include/c++/7.2.0/iostream:39:0,
                 from /scratch/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/unit_test/UnitTest_PushFinalizeHook.cpp:46:
/projects/sems/install/rhel6-x86_64/atdm/compiler/gcc/7.2.0/base/include/c++/7.2.0/ostream:108:7: note: candidate: std::basic_ostream<_CharT, _Traits>::__ostream_type& std::basic_ostream<_CharT, _Traits>::operator<<(std::basic_ostream<_CharT, _Traits>::__ostream_type& (*)(std::basic_ostream<_CharT, _Traits>::__ostream_type&)) [with _CharT = char; _Traits = std::char_traits<char>; std::basic_ostream<_CharT, _Traits>::__ostream_type = std::basic_ostream<char>]
       operator<<(__ostream_type& (*__pf)(__ostream_type&))
[...]

It looks like the breaking commit is:

0607bcd "Kokkos: Add Kokkos::push_finalize_hook function & tests (#2129)"
Author: Mark Hoemmen <mhoemmen@users.noreply.github.com>
Date:   Fri Jan 5 10:39:30 2018 -0700 (5 days ago)

M       packages/kokkos/core/src/Kokkos_Core.hpp
M       packages/kokkos/core/src/impl/Kokkos_Core.cpp
M       packages/kokkos/core/unit_test/CMakeLists.txt
M       packages/kokkos/core/unit_test/Makefile
A       packages/kokkos/core/unit_test/UnitTest_PushFinalizeHook.cpp
A       packages/kokkos/core/unit_test/UnitTest_PushFinalizeHook_terminate.cpp

I am backing up to the version of Trilinos before this that worked shown [above]() that was rooted at commit:

38a2158 "Shylu/Tacho - hand made team blas for gpu"
Author: Kyungjoo Kim <kyukim@sandia.gov>
Date:   Wed Jan 3 14:46:51 2018 -0700 (6 days ago)

A       packages/shylu/shylu_node/tacho/src/TachoExp_Blas_Team.hpp
M       packages/shylu/shylu_node/tacho/src/TachoExp_Util.hpp
M       packages/shylu/shylu_node/tacho/unit-test/Tacho_Test.hpp
A       packages/shylu/shylu_node/tacho/unit-test/Tacho_TestDenseLinearAlgebra.hpp
M       packages/shylu/shylu_node/tacho/unit-test/Tacho_TestOpenMP_double.cpp
M       packages/shylu/shylu_node/tacho/unit-test/Tacho_TestSerial_dcomplex.cpp
M       packages/shylu/shylu_node/tacho/unit-test/Tacho_TestSerial_double.cpp

and posting to CDash at:

I tested this and it passed the Kokkos build so this should build and link just fine.

This is a good example of why we need this automated build and simple instructions that any SNL Trilinos developer can use to reproduce build problems.

mhoemmen commented 6 years ago

@bartlettroscoe That test passed perfectly fine on other platforms, but yes, I second the need for automated testing. Did you actually revert the commit or just disable the failing test?

bartlettroscoe commented 6 years ago

I ran the full build and test of Trilinos with GCC 7.2.0 with the SPACK-built OpenBLAS BLAS and LAPACK and -DEpetraExt_ENABLE_HDF5=OFF which submitted to:

This built and passed all of the tests (see details below) but we are still seeing link warnings:

/usr/bin/ld: warning: libgfortran.so.3, needed by /usr/lib64/liblapack.so, may conflict with libgfortran.so.4

I belive the reason for this is that some of the other existing ATDM SEMS TPLs like SuperLU likely and perhaps others are built against the default system BLAS and LAPACK instead of the new SPACK built and installed OpenBLAS BLAS and LAPACK implementations built with GCC 7.2.0. This means that to get rid of all of these link warnings we would need to rebuild all of the downstream TPLs that depend on BLAS and/or LAPACK.

So the question is, should we work to fix these link warnings right now or just get this GCC 7.2.0 build up and going to start protecting Trilinos with a basic GCC 7.2.0 build? I am thinking that given that the Trilinos build for GCC 7.2.0 was just broken for an update to Trilinos as shown [above]() that it makes sense to just get this build running for now with the link warnings. Then later we can rebuild the TPLs with BLAS and LAPACK and eliminate these link warnings.

P.S. The other thing is that we really need to be using CMake 3.10.0 so that we can use the all-at-once configure, build, test, and submit but partition the output on the new CDash site. That would make it much more readable.

DETAILED NOTES (Click to expand) The rooted version of Trilinos was: ``` 38a2158 "Shylu/Tacho - hand made team blas for gpu" Author: Kyungjoo Kim Date: Wed Jan 3 14:46:51 2018 -0700 (7 days ago) A packages/shylu/shylu_node/tacho/src/TachoExp_Blas_Team.hpp M packages/shylu/shylu_node/tacho/src/TachoExp_Util.hpp M packages/shylu/shylu_node/tacho/unit-test/Tacho_Test.hpp A packages/shylu/shylu_node/tacho/unit-test/Tacho_TestDenseLinearAlgebra.hpp M packages/shylu/shylu_node/tacho/unit-test/Tacho_TestOpenMP_double.cpp M packages/shylu/shylu_node/tacho/unit-test/Tacho_TestSerial_dcomplex.cpp M packages/shylu/shylu_node/tacho/unit-test/Tacho_TestSerial_double.cpp ``` (newer versions of Trilinos fail to build as described [above](https://github.com/trilinos/Trilinos/issues/2028#issuecomment-356811082)). I did the configure and submit to CDash using: ``` $ cd /scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/MPI_RELEASE_DEBUG_SHARED_PT/ $ module purge $ source /scratch/rabartl/Trilinos.base/Trilinos/cmake/std/atdm/load_atdm_7.2_dev_env.sh $ module list Currently Loaded Modulefiles: 1) sems-env 2) atdm-env 3) sems-python/2.7.9 4) sems-git/2.10.1 5) atdm-gcc/7.2.0 6) atdm-openmpi/1.6.5/atdm 7) atdm-boost/1.63.0/atdm 8) atdm-zlib/1.2.8/atdm 9) atdm-hdf5/1.8.12/atdm 10) atdm-netcdf/4.4.1/atdm 11) atdm-parmetis/4.0.3/atdm 12) atdm-scotch/6.0.3/atdm 13) atdm-superlu/4.3/atdm 14) sems-cmake/3.5.2 $ export SPACK_LAPACK_LIB_DIR=/scratch/rabartl/SPACK.base/spack/opt/spack/linux-rhel6-x86_64/gcc-7.2.0/openblas-0.2.20-ex5gwnefywt4wbfyhuehdeh3ds6kj63q/lib $ time ./do-configure -DEpetraExt_ENABLE_HDF5=OFF -DTrilinos_ENABLE_ALL_PACKAGES=ON \ -DTPL_BLAS_LIBRARIES=$SPACK_LAPACK_LIB_DIR/libopenblas_sandybridge-r0.2.20.so \ -DTPL_LAPACK_LIBRARIES=$SPACK_LAPACK_LIB_DIR/libopenblas_sandybridge-r0.2.20.so \ &> configure.out real 2m59.641s user 2m21.178s sys 0m27.080s $ time env Trilinos_CTEST_DO_ALL_AT_ONCE=TRUE make dashboard &> make.dashboard.out real 237m46.915s user 1642m37.715s sys 83m58.528s ``` This posted results to: * https://testing-vm.sandia.gov/cdash/index.php?project=Trilinos&filtercount=1&showfilters=1&field1=buildstamp&compare1=61&value1=20180111-0336-Experimental The good news is that this passed build and all 2605 of the tests. The bad news is that we are still seeing link warnings at: * https://testing-vm.sandia.gov/cdash/viewBuildError.php?type=1&buildid=3254827 showing: ``` /usr/bin/ld: warning: libgfortran.so.3, needed by /usr/lib64/liblapack.so, may conflict with libgfortran.so.4 ``` I belive the reason for this is that some of the other existing ATDM SEMS TPLs like SuperLU likely and perhaps others are built against the default system BLAS and LAPACK instead of the new SPACK built and installed OpenBLAS BLAS and LAPACK implementations built with GCC 7.2.0. This means that for this to work, that all of the other TPLs that depend on BLAS and LAPACK will need to be rebuilt.
mhoemmen commented 6 years ago

@bartlettroscoe wrote:

... [S]hould we work to fix these link warnings right now or just get this GCC 7.2.0 build up and going to start protecting Trilinos with a basic GCC 7.2.0 build?

The latter, please :-). Thanks Ross!

bartlettroscoe commented 6 years ago

Current feedback from SEMS is that they will need to take the issue of officially supporting builds of BLAS and LAPACK to the SEMS Stewards. Therefore, I think we should move ahead and just get this GCC 7.2.0 build going up to the Specialized track on the CDash site so we can get it cleaned up again (and then move it to Nightly?).

Longer term, we need to rebuild the TPLs from source against BLAS and LAPaCk build with GCC 7.2.0. One option is to just use SPACK to build everything from GCC 7.2.0 on up for the TPLs (including BLAS and LAPACK) that we need and bypass the SEMS TPL installation process. We could put this under the ATDM project area on the mounted SEMS NFS drive. I think SPACK supports modules so that might be an easy solution and would have the added benefit that people could build these envs on non-SNL machines.

bartlettroscoe commented 6 years ago

I rebased the branch atdm-gcc-7.2.0-2028 on top of develop, added a new *.cmake file for the special configuration options for this build and disabled the build and run of the failing KokkosCore_UnitTest_PushFinalizeHook test and pushed to the remote:

To github.com:bartlettroscoe/Trilinos.git
 + c677baf...46cf207 atdm-gcc-7.2.0-2028 -> atdm-gcc-7.2.0-2028 (forced update)

I then tested the configure, build, and test of Kokkos, Teuchos, and EpetraExt and posted to CDash with make dashboard to:

This is now ready to use to build a CTest -S driver script then run it with Jenkins. That should be easy.

DETAILED NOTES (Click to expand) Now I can get ready to create a CTest -S script I want to put all settings into a `*.cmake` files. The do-configure script is: ``` #!/bin/bash cmake \ -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/sems/atdm/SEMSATDMSettings.cmake,cmake/std/MpiReleaseDebugSharedPtSettings.cmake,cmake/std/BasicCiTestingSettings.cmake \ -DDART_TESTING_TIMEOUT:STRING=300.0 \ -DTrilinos_ENABLE_TESTS:BOOL=ON \ -DCTEST_BUILD_FLAGS=-j10 \ -DCTEST_PARALLEL_LEVEL=10 \ "$@" \ ../../../Trilinos ``` I rebased the branch `atdm-gcc-7.2.0-2028` on top of the branch `develop` which is now has a broken build of a KokkosCore test so I disable that in the file `Trilinos/cmake/std/sems/atdm/SEMSATDMSettings.cmake`. The Trilinos vrsion on the rebased branch `atdm-gcc-7.2.0-2028` is: ``` 46cf207 "Disable KokkosCore_UnitTest_PushFinalizeHook build and run (#2028)" Author: Roscoe A. Bartlett Date: Thu Jan 11 18:28:57 2018 -0700 (15 minutes ago) M cmake/std/sems/atdm/SEMSATDMSettings.cmake ``` The configure, build, and test of a few packages: ``` $ cd scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/MPI_RELEASE_DEBUG_SHARED_PT/ $ source /scratch/rabartl/Trilinos.base/Trilinos/cmake/std/sems/atdm/load_atdm_7.2_dev_env.sh $ rm -r CMake* $ time ./do-configure \ -DTrilinos_ENABLE_Teuchos=ON \ -DTrilinos_ENABLE_Kokkos=ON \ -DTrilinos_ENABLE_EpetraExt=ON \ &> configure.out real 0m27.325s user 0m10.687s sys 0m4.832s $ time make -j16 &> make.out real 2m0.551s user 16m46.994s sys 3m39.221s $ time ctest -j16 &> ctest.out real 0m18.262s user 0m48.984s sys 0m11.042s ``` And the tests passesd and gave: ``` 100% tests passed, 0 tests failed out of 169 Label Time Summary: EpetraExt = 15.87 sec (10 tests) Kokkos = 36.81 sec (22 tests) Teuchos = 66.96 sec (137 tests) Total Test time (real) = 18.25 sec ``` So that looks like a good configuration of Trilinos! Just for good measure, I also submitted to CDash with: ``` $ time make dashboard &> make.dashboard.out real 1m45.500s user 1m25.496s sys 0m45.976s ``` This submitted to: * https://testing.sandia.gov/cdash/index.php?project=Trilinos&showfilters=1&filtercount=1&showfilters=1&field1=buildstamp&compare1=61&value1=20180112-0147-Experimental Now to turn this into a CTest -S driver script and test it out ...
bartlettroscoe commented 6 years ago

I created the basic driver scripts:

and did created a new *.cmake file for extra configuration options:

I ran the script drive_linux_mpi_sems_atdm_7.2.0.sh locally (as shown in details below) and it submitted to:

Interstingly, there are three new failing Tempus tests that were not there before. I will create a new GitHub issue for that in a bit

DETAILED NOTES (Click to expand) I developed the CTest -S driver scripts and for the first test I do: ``` $ cd /scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/ $ mkdir MOCK_GCC-7.2.0-MPI_RELEASE_ADTM_BASE $ cd MOCK_GCC-7.2.0-MPI_RELEASE_ADTM_BASE/ $ mkdir MOCK_GCC-7.2.0-MPI_RELEASE_ADTM $ cd MOCK_GCC-7.2.0-MPI_RELEASE_ADTM/ $ ln -s /scratch/rabartl/Trilinos.base/Trilinos . $ env \ CTEST_TEST_TYPE=Experimental \ CTEST_DO_SUBMIT=OFF \ CTEST_DO_UPDATES=OFF \ CTEST_START_WITH_EMPTY_BINARY_DIRECTORY=TRUE \ Trilinos_PACKAGES=Kokkos,Teuchos,EpetraExt \ /scratch/rabartl/Trilinos.base/Trilinos/cmake/ctest/drivers/atdm/drive_linux_mpi_sems_atdm_7.2.0.sh \ &> console.out ``` Here, I had to create the directory `MOCK_GCC-7.2.0-MPI_RELEASE_ADTM` first and symlink in the Trilinos source directory as per the instructions in Step "5. Test CTest -S driver scripts" at: * https://tribits.org/doc/TribitsDevelopersGuide.html#how-to-submit-testing-results-to-a-cdash-site After some debugging, I got this running with: ``` $ env \ CTEST_TEST_TYPE=Experimental \ CTEST_DO_SUBMIT=ON \ CTEST_DO_UPDATES=OFF \ CTEST_START_WITH_EMPTY_BINARY_DIRECTORY=TRUE \ Trilinos_PACKAGES=Kokkos,Teuchos,EpetraExt \ /scratch/rabartl/Trilinos.base/Trilinos/cmake/ctest/drivers/atdm/drive_linux_mpi_sems_atdm_7.2.0.sh \ &> console.out ``` which sumitted to: * https://testing.sandia.gov/cdash/index.php?project=Trilinos&showfilters=1&filtercount=1&showfilters=1&field1=buildstamp&compare1=61&value1=20180112-1338-Experimental So I think this is ready to clean up the commits and push to the main 'develop' branch. Then I can run the script for real and let it clone and do everything. ToDo: * Clean up commits, rebase on top of `develop`, and force push the branch [Done] * Merge the branch into `develop`, test (if any testing is needed) and push [Done] Now I will run the script the same way that a Jenkins job would run the script: ``` $ cd /scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/MOCK_GCC-7.2.0-MPI_RELEASE_ADTM_BASE/ $ rm -r MOCK_GCC-7.2.0-MPI_RELEASE_ADTM $ time env \ /scratch/rabartl/Trilinos.base/Trilinos/cmake/ctest/drivers/atdm/drive_linux_mpi_sems_atdm_7.2.0.sh \ &> console.out real 366m0.895s user 2089m45.734s sys 146m52.689s ``` This posted to: * https://testing.sandia.gov/cdash/index.php?project=Trilinos&showfilters=1&filtercount=3&showfilters=1&filtercombine=and&field1=site&compare1=61&value1=ceerws1113&field2=buildname&compare2=61&value2=Linux-GCC-7.2.0-MPI_RELEASE_ADTM&field3=buildstamp&compare3=61&value3=20180112-0400-Specialized
bartlettroscoe commented 6 years ago

@fryeguy52 set up the Jenkins job to drive this build on the SEMS SRN Build Farm. However, it resulted in all failed configures shown here:

The configure failures said that it was missing the file:

-- Reading in configuration options from cmake/std/sems/atdm/SEMSATDMSettings.cmake ...
CMake Error at /jenkins/slave/workspace/Trilinos_gcc-7-2-0_atdm/Trilinos/cmake/tribits/core/package_arch/TribitsGlobalMacros.cmake:167 (INCLUDE):
  INCLUDE could not find load file:

    cmake/std/sems/atdm/SEMSATDMSettings.cmake

Turns out the problem was that for some reason the inner driver cloned the "nightly" git repo software.sandia.gov:/space/git/nightly/Trilinos instead of the GitHub repo. therefore, the develop branch was not up-to-date yet in that repo so the new commits did not exist in the repo yet.

I switched the Jenkins job Trilinos_gcc-7-2-0_atdm to also clone the "nighty" git repo for Trilinos so the outer and inner Trilinos repos always match up. Therefore, the build should run tonight and submit to CDash.

bartlettroscoe commented 6 years ago

Looks like as predicted above, the Jenkins job Trilinos_gcc-7-2-0_atdm ran just find and posted correct output to CDash at:

This showed just two failing Tempus tests this time. (I will create a GitHub issue for those.)

The only problem is that I misspelled "ATDM" as "ADTM". I just fixed that with the commit 2ac807d so it should be spelled right tomorrow.

I will also create follow-on issues to get the TPLs rebuilt with BLAS and LAPACK (to eliminate link warnings with libgfortran) and and then build newer versions of the TPLs that were requested by the ATDM APPs in https://software-srn.sandia.gov/jira/browse/CDOFA-24. Then I will descope this Issue and put it in review.

bartlettroscoe commented 6 years ago

Several Tempus tests are timing out as shown at:

I am wondering why this is occurring. How does Jenkins know how many cores are used in each build? Is that the Jenkins "Job Weight" parameter? If so, it is currently set to "6" for the Trilinos_gcc-7-2-0_atdm job. But I know that this build takes 10 cores in the build and running the tests (see the file Trilinos/cmake/ctest/drivers/atdm/ctest_linux_mpi_sems_atdm_7.2.0.cmake). Therefore, I will change "Job Weight" to "10". Looking at the chart:

it looks like the Jenkins slave machine gretel was getting fully loaded (20 cores) from about 21:49 to about 04:03. This must be the cause of the timeouts of the Tempus tests.

@fryeguy52 and @trilinos/framework,

How can we make sure that all of the Jenkins jobs that are running on these machines have the correct "Job Weight" property so that Jenkins does not overload its slave machines? The "Job Weight" value was set incorrectly for our job so how can we determine if it is being set correctly for other jobs as well?

bartlettroscoe commented 6 years ago

Looks like the machine hansel is getting overloaded for periods of time as well:

I sent the following email to see if we can investigate this issue more and see what can be done.

Otherwise, I will just increase the default timeout from 300s to 600s and see if it makes the timeouts go away.


From: trilinos-framework-bounces@software.sandia.gov [mailto:trilinos-framework-bounces@software.sandia.gov] On Behalf Of Bartlett, Roscoe A Sent: Monday, January 15, 2018 10:12 AM To: Frye, Joe ; trilinos-framework@software.sandia.gov Subject: [Trilinos-Framework] Jenkins jobs overloading slave machines?

Hello Joe and Trilinos Framework team members,

It seems that Jenkins is overloading Jenkins slave machines and causing timeouts (see https://github.com/trilinos/Trilinos/issues/2028#issuecomment-357706804). It seems there is a “Job Weight” setting that is supposed to tell Jenkins how many cores a job will use (kind of like the PROCESSORS CTest property). I think this caused a bunch of timeouts on the new GCC 7.2.0 build run on the machine Gretel. There are a bunch of other jobs that are being run there as well as shown at https://jenkins-srn.sandia.gov/computer/gretel/builds . Jenkins has to be set up to not fully load (or worse overload) a test machine or multi-process MPI jobs will take much longer to run and will cause timeouts like this.

How can we get to the bottom of this?

Thanks,

-Ross

bartlettroscoe commented 6 years ago

I discussed this with @jwillenbring and @fryeguy52 and one suggestion was to increase the "Job Weight" value to make sure that this job takes up an entire machine. My concern with doing that is that I am afraid that the job may not be scheduled at all. I think we need a better strategy to manage these Jenkins build machines. I will bring this up at the next CDOFA meeting.

bartlettroscoe commented 6 years ago

It looks like increasing the timeout limit from 300s to 600s (i.e. 10 minutes) fixed all of the timeouts were were seeing. The build Linux-GCC-7.2.0-MPI_RELEASE_ATDM today shown at:

has all passing tests. Digging deeper and looking at the test times shown here, you can see that the most expensive test was Tempus_DIRK_Combined_FSA_MPI_1 at 7m 43s 950ms. When I ran the full test suite on an older version of Trilinos on my machine ceerws1113 and posted results to:

the test times were much smaller as shown here and the same test Tempus_DIRK_Combined_FSA_MPI_1 has the time 5m 20s 410ms. That is a 31% increase in the runtime for the test. My guess is that if the machine is over-loaded that we will see a lot of fluctuation in these tests times over the coming days.

In any case, this build is now ready to elevate from Specialized to some CDash Track/Group that will emails. As discussed in #1293, that can't be the Nightly group because that group does not send out any CDash emails. We could move this into the Clean group but I am not sure that is the right thing to do. I will suggest adding an ATDM group that will send out emails and then send it there.

bartlettroscoe commented 6 years ago

Our current GCC 7.2.0 build is all passing but it is not enabling or running with OpenMP. Should it be? See https://github.com/trilinos/Trilinos/issues/2130#issuecomment-358081448.

bartlettroscoe commented 6 years ago

Feedback from the EMPIRE ATDM APP lead that we should be enabling OpenMP and testing with OMP_NUM_THREADS=4. I am trying that now.

bartlettroscoe commented 6 years ago

I tried the simple enable of Trilinos_ENABLE_OpenMP=ON in the trial commit 3b609f0ad2f2ff9d253ceec058bb982b5812b93c pushed to the branch 2028-enable-openmp. I did a all-at-once submit to CDash which is shown (on the trial CDash site that supports the new all-at-once method) at:

and details are shown below.

This trial build showed a single Panzer build failure but the major problem was that almost all of the tests in downstream packages died on startup due to missing instantations from Tpetra functions instantiated for a Serial type (but different tests showed different missing function definitions). It is not clear why the linker did not even warn about these missing symboles. In any case, the strightforward enable of OpenMP is not working at all. Getting an OpenMP build working should be a seprate issue. Also, while other OpenMP builds are failing on CDash we should likey wait until those builds are worked out before pushing on this with a GCC 7.2.0 build.

DETAILED NOTES (Click to expand) Enabling OpenMP with: ``` commit 3b609f0ad2f2ff9d253ceec058bb982b5812b93c Author: Roscoe A. Bartlett Date: Tue Jan 16 14:28:58 2018 -0700 Enable OpenMP for ATDM GCC 7.2.0 build (#2028) Word from Matt B. is that OpenMP should be enabled for this build. diff --git a/cmake/std/sems/atdm/SEMSATDMSettings.cmake b/cmake/std/sems/atdm/SEMSATDMSettings.cmake index 4f2a395..0a84f8c 100644 --- a/cmake/std/sems/atdm/SEMSATDMSettings.cmake +++ b/cmake/std/sems/atdm/SEMSATDMSettings.cmake @@ -2,6 +2,8 @@ # These are special setting for the ATDM configuration of Trilinos using the SEMS # +SET(${PROJECT_NAME}_ENABLE_OpenMP ON CACHE BOOL "Set in SEMSATDMSettings.cmake") + # ATDM builds of Trilinos don't need HDF5 support in EpetraExt and this avoids # a build error with GCC 7.2.0 (see #2080) SET(EpetraExt_ENABLE_HDF5 OFF CACHE BOOL "Set in SEMSATDMSettings.cmake") ``` Using the `do-configure` script: ``` #!/bin/bash cmake \ -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/sems/atdm/SEMSATDMSettings.cmake,cmake/std/MpiReleaseDebugSharedPtSettings.cmake,cmake/std/BasicCiTestingSettings.cmake \ -DDART_TESTING_TIMEOUT:STRING=300.0 \ -DTrilinos_ENABLE_TESTS:BOOL=ON \ -DCTEST_BUILD_FLAGS=-j10 \ -DCTEST_PARALLEL_LEVEL=10 \ "$@" \ ../../../Trilinos ``` on ceerws1113: ``` $ cd /scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/MPI_RELEASE_DEBUG_SHARED_PT/ $ . /scratch/rabartl/Trilinos.base/Trilinos/cmake/std/sems/atdm/load_atdm_7.2_dev_env.sh $ rm -r CMake* $ time ./do-configure -DTrilinos_ENABLE_Kokkos=ON &> configure.out real 0m14.559s user 0m6.877s sys 0m4.336s ``` This showed in the cmake STDOUT: ``` -- ****************** Kokkos Settings ****************** -- Execution Spaces -- Device Parallel: None -- Host Parallel: OpenMP -- Host Serial: Serial -- -- Architectures: -- None -- -- Enabled options -- KOKKOS_ENABLE_PROFILING -- -- Final kokkos settings variable: -- env;KOKKOS_SRC_PATH=/scratch/rabartl/Trilinos.base/Trilinos/packages/kokkos;KOKKOS_PATH=/scratch/rabartl/Trilinos.base/Trilinos/packages/kokkos;KOKKOS_INSTALL_PATH=/usr/local;KOKKOS_ARCH=None;KOKKOS_DEVICES=OpenMP,Serial;KOKKOS_DEBUG=no;KOKKOS_OPTIONS=disable_dualview_modify_check ``` Then I build and test with: ``` $ time make -j16 &> make.out real 1m44.436s user 20m57.566s sys 1m11.163s $ time env OMP_NUM_THREADS=4 ctest -j10 &> ctest.out real 0m21.606s user 2m2.438s sys 0m2.231s ``` That returned all passing tests: ``` 100% tests passed, 0 tests failed out of 24 Subproject Time Summary: Kokkos = 79.69 sec*proc (24 tests) Total Test time (real) = 21.21 sec ``` Now to try all of the Trilinos packages: ``` $ rm -r CMake* $ time ./do-configure -DTrilinos_ENABLE_ALL_PACKAGES=ON &> configure.out real 3m32.101s user 2m23.608s sys 0m48.344s $ time env OMP_NUM_THREADS=4 Trilinos_CTEST_DO_ALL_AT_ONCE=TRUE make dashboard &> make.dashboard.out real 0m29.282s user 0m27.576s sys 0m1.092s ``` This is submitting to: * https://testing.sandia.gov/cdash/index.php?project=Trilinos&date=2018-01-16&filtercount=1&showfilters=1&field1=buildstamp&compare1=61&value1=20180116-2223-Experimental This is showing a strange build failure for Panzer for the file`packages/panzer/disc-fe/src/Panzer_BCStrategy_Dirichlet_DefaultImpl.cpp` at: * https://testing.sandia.gov/cdash/viewBuildError.php?buildid=3333831 which shows: ``` In file included from /scratch/rabartl/Trilinos.base/Trilinos/packages/panzer/disc-fe/src/Panzer_BCStrategy_Dirichlet_DefaultImpl_impl.hpp:68:0, from /scratch/rabartl/Trilinos.base/Trilinos/packages/panzer/disc-fe/src/Panzer_BCStrategy_Dirichlet_DefaultImpl.cpp:48: /scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/MPI_RELEASE_DEBUG_SHARED_PT/packages/panzer/disc-fe/src/Panzer_DOF.hpp:1:10: fatal error: Panzer_DOF_decl.hpp: No such file or directory #include "Panzer_DOF_decl.hpp" ^~~~~~~~~~~~~~~~~~~~~ compilation terminated. ``` That build generated 842 test failures and 11 not-run tests! There were failing tests in lots of packages. I will run this again with the new all-at-once features enabled so that we can see the results better partitioned out on CDash. I looked at several of these failing tests and they die right away when trying to load the shared libs with errors like for the test [Anasazi_MultiVecTraitsTest2_MPI_4](https://testing.sandia.gov/cdash/testDetails.php?test=44026018&build=3333831) showing: ``` /scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/MPI_RELEASE_DEBUG_SHARED_PT/packages/anasazi/tpetra/test/MVOPTester/Anasazi_Tpetra_MVOPTester.exe: symbol lookup error: /scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/MPI_RELEASE_DEBUG_SHARED_PT/packages/anasazi/tpetra/test/MVOPTester/Anasazi_Tpetra_MVOPTester.exe: undefined symbol: _ZNK6Tpetra11MpiPlatformIN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE7getCommEv /scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/MPI_RELEASE_DEBUG_SHARED_PT/packages/anasazi/tpetra/test/MVOPTester/Anasazi_Tpetra_MVOPTester.exe: symbol lookup error: /scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/MPI_RELEASE_DEBUG_SHARED_PT/packages/anasazi/tpetra/test/MVOPTester/Anasazi_Tpetra_MVOPTester.exe: undefined symbol: _ZNK6Tpetra11MpiPlatformIN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE7getCommEv /scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/MPI_RELEASE_DEBUG_SHARED_PT/packages/anasazi/tpetra/test/MVOPTester/Anasazi_Tpetra_MVOPTester.exe: symbol lookup error: /scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/MPI_RELEASE_DEBUG_SHARED_PT/packages/anasazi/tpetra/test/MVOPTester/Anasazi_Tpetra_MVOPTester.exe: undefined symbol: _ZNK6Tpetra11MpiPlatformIN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE7getCommEv Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ceerws1113 and rank 0! *** *** Unit test suite ... *** Sorting tests by group name then by the order they were added ... (time = 9.2e-05) Running unit tests ... Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ceerws1113 and rank 1! Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ceerws1113 and rank 2! Teuchos::GlobalMPISession::GlobalMPISession(): started processor with name ceerws1113 and rank 3! -------------------------------------------------------------------------- mpiexec has exited due to process rank 0 with PID 42859 on node ceerws1113 exiting improperly. There are two reasons this could occur: 1. this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination. 2. this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination" This may have caused other processes in the application to be terminated by signals sent by mpiexec (as reported here). -------------------------------------------------------------------------- ``` And the test [MueLu_BlockedTransfer_Tpetra_MPI_4](https://testing.sandia.gov/cdash/testDetails.php?test=44024615&build=3333831) showing: ``` /scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/MPI_RELEASE_DEBUG_SHARED_PT/packages/muelu/test/blockedtransfer/MueLu_BlockedTransfer.exe: symbol lookup error: /scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/MPI_RELEASE_DEBUG_SHARED_PT/packages/ifpack2/src/libifpack2.so.12: undefined symbol: _ZNK6Tpetra9RowMatrixIdiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE4packERKN7Teuchos9ArrayViewIKiEERNS8_5ArrayIcEERKNS9_ImEERmRNS_11DistributorE /scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/MPI_RELEASE_DEBUG_SHARED_PT/packages/muelu/test/blockedtransfer/MueLu_BlockedTransfer.exe: symbol lookup error: /scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/MPI_RELEASE_DEBUG_SHARED_PT/packages/ifpack2/src/libifpack2.so.12: undefined symbol: _ZNK6Tpetra9RowMatrixIdiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE4packERKN7Teuchos9ArrayViewIKiEERNS8_5ArrayIcEERKNS9_ImEERmRNS_11DistributorE /scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/MPI_RELEASE_DEBUG_SHARED_PT/packages/muelu/test/blockedtransfer/MueLu_BlockedTransfer.exe: symbol lookup error: /scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/MPI_RELEASE_DEBUG_SHARED_PT/packages/ifpack2/src/libifpack2.so.12: undefined symbol: _ZNK6Tpetra9RowMatrixIdiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE4packERKN7Teuchos9ArrayViewIKiEERNS8_5ArrayIcEERKNS9_ImEERmRNS_11DistributorE /scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/MPI_RELEASE_DEBUG_SHARED_PT/packages/muelu/test/blockedtransfer/MueLu_BlockedTransfer.exe: symbol lookup error: /scratch/rabartl/Trilinos.base/BUILDS/GCC-7.2.0/MPI_RELEASE_DEBUG_SHARED_PT/packages/ifpack2/src/libifpack2.so.12: undefined symbol: _ZNK6Tpetra9RowMatrixIdiiN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE4packERKN7Teuchos9ArrayViewIKiEERNS8_5ArrayIcEERKNS9_ImEERmRNS_11DistributorE -------------------------------------------------------------------------- mpiexec noticed that the job aborted, but has no info as to the process that caused that situation. -------------------------------------------------------------------------- ``` In each case I looked at, it seems there is something missing with the Serial instantiation. I will go ahead and submit this with the full all-at-once features so we can see results broken out on the new trial CDash site: ``` $ time ./do-configure -DTrilinos_ENABLE_ALL_PACKAGES=ON &> configure.out real 2m54.584s user 2m8.361s sys 0m40.380s $ time env OMP_NUM_THREADS=4 \ Trilinos_CTEST_DO_ALL_AT_ONCE=TRUE \ Trilinos_CTEST_USE_NEW_AAO_FEATURES=ON \ make dashboard &> make.dashboard.out real 33m10.668s user 77m29.702s sys 17m27.000s ``` This is submitting to: * https://testing-vm.sandia.gov/cdash/index.php?project=Trilinos&date=2018-01-16&filtercount=1&showfilters=1&field1=buildstamp&compare1=61&value1=20180117-0022-Experimental This time it is broken down nicely package-by-package: * https://testing-vm.sandia.gov/cdash/index.php?project=Trilinos&parentid=3265070 Now we can really see the breadkdown of what packages are failing and they start with two failing Tpetra tests: * https://testing-vm.sandia.gov/cdash/viewTest.php?onlyfailed&buildid=3265120 but these don't show the missing symblos errors that later package test failures show. But these do die with setfaults. What is strange is that the MueLu build does not show any build warnings yet we have all of these problems with missing symbols crashing the tests. Very strange. This build with OpenMP enabled is just a bit of a mess. It think I will try one last thing and try to disable the Serial node and see what happens.
bartlettroscoe commented 6 years ago

FYI: I renamed the Jenkins build to Trilinos-atdm-sems-gcc-7-2-0. This is more consistent with the other ATDM builds we are setting up.

mhoemmen commented 6 years ago

@bartlettroscoe Kokkos changed its configuration recently in such a way as to disable all but one Tpetra execution space instantiation by default.

bartlettroscoe commented 6 years ago

@bartlettroscoe Kokkos changed its configuration recently in such a way as to disable all but one Tpetra execution space instantiation by default.

Is this fixable? Are there any automated builds showing this problem on the Trilinos CDash site:

bartlettroscoe commented 6 years ago

Kokkos changed its configuration recently in such a way as to disable all but one Tpetra execution space instantiation by default.

@mhoemmen,

Is this the reason for the OpenMP build failures that I reported above?

mhoemmen commented 6 years ago

@bartlettroscoe I think it could be, yes.

bartlettroscoe commented 6 years ago

The GCC 7.2.0 build Linux-GCC-7.2.0-MPI_RELEASE_ATDM has been running fine until this morning when it ran on the machine "winstone" and it failed to find BLAS:

Before that it ran on the machine 'hansel' and 'gretel' (cute). It looks like those machines both have the label "RHEL6" so I will add that the to Jenkins job:

Hopefully this will keep this from happening again. It looks like that should fix this.

I fired off the build manually again so hopefully it will resubmit and show up clean now.

bartlettroscoe commented 6 years ago

I made the commit cb26a95ab884d4a3c7324a48d427cdc90f5ad1b6:

commit cb26a95ab884d4a3c7324a48d427cdc90f5ad1b6
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date:   Mon Jan 29 19:25:03 2018 -0700

    Moved GCC 7.2.0 files into subdir and other changes (#2028)

    * Make the CDash build name the same as the Jenkins name

    * Moved files to subdir to be consistent with other ATDM driver files and not
      clutter the base dir.

    * Remove extra repos

    * Change CDash build name to same as Jenkins name and to be more consistent
      with other ATDM build names

    * Send to ATDM track

R090    cmake/ctest/drivers/atdm/ctest_linux_mpi_sems_atdm_7.2.0.cmake  cmake/ctest/drivers/atdm/sems_gcc-7.2.0/ctest_linux_mpi_sems_atdm_7.2.0.cmake
R091    cmake/ctest/drivers/atdm/drive_linux_mpi_sems_atdm_7.2.0.sh     cmake/ctest/drivers/atdm/sems_gcc-7.2.0/drive_linux_mpi_sems_atdm_7.2.0.sh

Now I need to watch CDash tomorrow morning to make sure that the build shows up correctly. I did some testing locally so I have high hopes that this will work.

bartlettroscoe commented 6 years ago

The updated GCC 7.2.0 build showed up correctly now under the same name as the :