trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.21k stars 565 forks source link

Stokhos build failure for CUDA 8.0 Debug build on white/ride #3069

Closed bartlettroscoe closed 6 years ago

bartlettroscoe commented 6 years ago

CC: @trilinos/stokhos, @rppawlo (Trilinos Nonlinear Solvers Product Lead)

Next Action Status

PR #3100 merged on 7/13/2018 resulted in 100% clean build (but not tests), including Stokhos, on 7/14/2018.

Description

The creation of the Stokhos library libstokhos_muelu.a fails in the CUDA 8.0 debug build Trilinos-atdm-white-ride-cuda-debug-pt-all-at-once 'white' and 'ride'. The build error output for the build this morning shown here on 'white' shows:

/usr/bin/ar: packages/stokhos/src/libstokhos_muelu.a: File truncated

Steps to reproduce

This build error can reproduced on 'white' or 'ride' as described in the document:

The specific instructions for 'white' or 'ride' are given at:

The one difference is that this build of all of the Primary Tested Trilinos packages (that includes more package than are being used by ATDM APPs currently) does not exclude any Trilinos packages and tweaks a few other settings so it uses the file Trilinos/cmake/std/atdm/ATDMDevEnvAllPtPackages.cmake instead of the file ATDMDevEnv.cmake.

After cloning Trilinos, the following commands should reproduce the build failure:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh cuda-debug

$ cmake \
  -GNinja \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnvAllPtPackages.cmake \
  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Stokhos=ON \
  $TRILINOS_DIR

$ make NP=16

I (@bartlettroscoe) just tired this on 'white' and I was able to reproduce the same build failure shown on CDash shown above.

mhoemmen commented 6 years ago

Is this one of those situations that calls for -D CMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS=ON?

bartlettroscoe commented 6 years ago

@mhoemmen said:

Is this one of those situations that calls for -D CMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS=ON?

Perhaps worth trying. This library has 441 object files totaling 53854 chars.

bartlettroscoe commented 6 years ago

FYI: After some experimenting and investigation (and leaning several things about CMake that I did not know), it looks like we can't use resource files *.rsp to handle long lists of object files for this build unless we ditch both Ninja and static libs. I am running the full build now but it looks like if we use Makefiles and shared libs, we might be able to use an *.rsp file for the creation of these libraries. We will see in a few hours if this fixes this CUDA debug Stokhos build or not. But I am finding that moving to a shared lib build is causing build errors for some reason with a few packages like Pamgen and Shards. I don't know if it makes sense to be bothering with CUDA builds for packages like that. I am disabling non-critical packages as we go.

bartlettroscoe commented 6 years ago

So there is good news and bad news. The good news is that you can make this (and any other) Stokhos build failure go away if you set -DCMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS=ON. The bad news is that you can't use Ninja, you can't use static libs, and you can't build with the Primary Tested packages Pamgen or Shared enabled. See details below.

You can't use Ninja because in order to get CMake to use *.rsp resource files, you have to use the option -DCMAKE_NINJA_FORCE_RESPONSE_FILE=ON which also results in *.rsp files being used to pass in include directories to the compile. The problem is that it seems that nvcc that comes with CUDA 8.0.44 which is used in this CUDA build on white does not seem to support these *.rsp resource files which are passed through by kokkos/bin/nvcc_wrapper. You can't turn off the usage of *.rsp files with the current CMake if you want to use them for object files. (But we can ask Kitware to fix this. I suspect no reason they can't fix this.) Therefore, you have to use the built-in CMake Makefile generator. (This results in slower less parallel builds and slower dependency analysis and other disadvantages.)

You can't use static libraries (i.e. -D BUILD_SHARED_LIBS=OFF) because CMake does not use *.rsp resource files because CMake uses the ar program to create static libs and it would seem that ar does not support *.rsp resource files. (But I suspect that Kitware could add support to CMake to incrementally build a static lib using multiple calls to ar to build the static libs incrementally.)

So, if we want to build Stokhos on this platform with a CUDA 8.0 debug build, we need to use Makefiles and shared libs with the packages Pamgen and Shards disabled.

I am now running a fuller build of Trilinos to see how that goes ...

DETAILED NOTES: (click to expand) Trying the Stokhos build again but this time with `-DCMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS=ON`: ``` $ cd ~/Trilinos.base/BUILD/WHITE/CUDA/CUDA-DEBUG/ $ rm -r CMake* $ time cmake \ -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnvSettings.cmake \ -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Stokhos=ON \ -DCMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS=ON \ ~/Trilinos.base/Trilinos \ &> configure.out real 1m23.777s user 0m53.432s sys 0m11.293s $ time ninja -j16 &> make.out real 3m38.491s user 19m15.804s sys 2m24.839s ``` I tried that and the build failed in the same way. It looks like the option `-DCMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS=ON` does **not** result in resource files being used and instead we still see the object files explicitly listed out as: ``` /usr/bin/ar qc packages/stokhos/src/libstokhos_muelu.a packages/stokhos/src/CMakeFiles/stokhos_muelu.dir/Stokhos_Dummy.cpp.o packages/stokhos/src/CMakeFiles/stokhos_muelu.dir/sacado/kokkos/vector/muelu/MueLu_AMGXOperator_MP_Vector_Serial.cpp.o packages/stokhos/src/CMakeFiles/stokhos_muelu.dir/sacado/kokkos/vector/muelu/MueLu_AdaptiveSaMLParameterListInterpreter_MP_Vector_Serial.cpp.o packages/stokhos/src/CMakeFiles/stokhos_muelu.dir/sacado/kokkos/vector/muelu/MueLu_AggregationExportFactory_MP_Vector_Serial.cpp.o packages/stokhos/src/CMakeFiles/stokhos_muelu.dir/sacado/kokkos/vector/muelu/MueLu_AlgebraicPermutationStrategy_MP_Vector_Serial.cpp.o ... /usr/bin/ar: packages/stokhos/src/libstokhos_muelu.a: File truncated ``` What about the link of other executables with object files? Does it use resource files for those? I tested this with: ``` $ make VERBOSE=1 TeuchosCore_Range1D_UnitTest ninja -C ../../../../.. -v TeuchosCore_Range1D_UnitTest ninja: Entering directory `../../../../..' ... [3/3] : && /home/projects/pwr8-rhel73-lsf/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44/bin/mpicxx --std=c++11 -lineinfo -g -expt-extended-lambda -arch=sm_37 -g -O0 -rdynamic packages/teuchos/core/test/UnitTest/CMakeFiles/TeuchosCore_Range1D_UnitTest.dir/Range1D_UnitTests.cpp.o packages/teuchos/core/test/UnitTest/CMakeFiles/TeuchosCore_Range1D_UnitTest.dir/Teuchos_StandardUnitTestMain.cpp.o -o packages/teuchos/core/test/UnitTest/TeuchosCore_Range1D_UnitTest.exe packages/teuchos/core/src/libteuchoscore.a packages/kokkos/core/src/libkokkoscore.a -ldl /ascldap/users/projects/pwr8-rhel73-lsf/cuda/8.0.44/lib64/libcudart.so /ascldap/users/projects/pwr8-rhel73-lsf/cuda/8.0.44/lib64/libcublas.so /ascldap/users/projects/pwr8-rhel73-lsf/cuda/8.0.44/lib64/libcufft.so && : ``` So that directly lists object files so the option `-DCMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS=ON` does not seem to be causing this build to use `*.rsp` files. I grep the `CMakeLists.txt` file and it shows: ``` $ grep CMAKE_CXX_USE_RESPONSE CMakeCache.txt CMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS:UNINITIALIZED=ON ``` So what is up with this? I tried using the Makefile generator and it did result in using `*.rsp` files either. I went over to my machine 'crf450' and did a rhel6 gnu-debug-openmp build from there with shared libs with: ``` $ cd /Trilinos.base/BUILDS/ATDM/GNU/GNU_DEBUG_OPENMP/ $ . load-env.sh Hostname 'crf450.srn.sandia.gov' matches known ATDM host 'sems-rhel6' and system 'rhel6' ATDM_CONFIG_TRILNOS_DIR = /home/rabartl/Trilinos.base/Trilinos Setting default compiler and build options for JOB_NAME='gnu-debug-openmp' Using SEMS RHEL6 compiler stack GNU to build DEBUG code with Kokkos node type OPENMP $ rm -r CMake* $ cmake \ -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnvSettings.cmake \ -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Teuchos=ON \ -DCMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS=ON \ -DBUILD_SHARED_LIBS=ON \ ~/Trilinos.base/Trilinos \ &> configure.out $ cd /packages/teuchos/core/src/ $ make -j16 VERBOSE=1 ``` and it showed: ``` /projects/sems/install/rhel6-x86_64/sems/compiler/gcc/6.1.0/openmpi/1.10.1/bin/mpicxx -fPIC --std=c++11 -g -fopenmp -g -O0 -shared -Wl,-soname,libteuchoscore.so.12 -o libteuchoscore.so.12.13 @CMakeFiles/teuchoscore.dir/objects1.rsp -Wl,-rpath,/ascldap/users/rabartl/Trilinos.base/BUILDS/ATDM/GNU/GNU_DEBUG_OPENMP/packages/kokkos/core/src: ../../../kokkos/core/src/libkokkoscore.so.12.13 -ldl ``` That showed the `*.rsp` file: ``` libteuchoscore.so.12.13 @CMakeFiles/teuchoscore.dir/objects1.rsp ``` getting used. But if I go back and build with static libs with: ``` $ rm -r CMake* $ cmake \ -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnvSettings.cmake \ -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Teuchos=ON \ -DCMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS=ON \ -DBUILD_SHARED_LIBS=OFF \ ~/Trilinos.base/Trilinos \ &> configure.out $ cd /packages/teuchos/core/src/ $ make -j16 VERBOSE=1 ``` and it showed: ``` /usr/bin/ar qc libteuchoscore.a CMakeFiles/teuchoscore.dir/Teuchos_ArrayView.cpp.o CMakeFiles/teuchoscore.dir/Teuchos_CWrapperSupport.cpp.o CMakeFiles/teuchoscore.dir/andLineProcessor.cpp.o CMakeFiles/teuchoscore.dir/Teuchos_Describable.cpp.o CMakeFiles/teuchoscore.dir/Teuchos_Details_Allocator.cpp.o CMakeFiles/teuchoscore.dir/Teuchoession.cpp.o CMakeFiles/teuchoscore.dir/Teuchos_HashUtils.cpp.o CMakeFiles/teuchoscore.dir/Teuchos_LabeledObject.cpp.o CMakeFiles/teuchoscore.dir/Teuchos_PrintDouble.cpes/teuchoscore.dir/Teuchos_Ptr.cpp.o CMakeFiles/teuchoscore.dir/Teuchos_RCPNode.cpp.o CMakeFiles/teuchoscore.dir/Teuchos_Range1D.cpp.o CMakeFiles/teuchoscore.dir/Teuchots.cpp.o CMakeFiles/teuchoscore.dir/Teuchos_StrUtils.cpp.o CMakeFiles/teuchoscore.dir/Teuchos_TabularOutputter.cpp.o CMakeFiles/teuchoscore.dir/Teuchos_TestForExceptionFiles/teuchoscore.dir/Teuchos_TestingHelpers.cpp.o CMakeFiles/teuchoscore.dir/Teuchos_Time.cpp.o CMakeFiles/teuchoscore.dir/Teuchos_TypeNameTraits.cpp.o CMakeFiles/teucTeuchos_UnitTestBase.cpp.o CMakeFiles/teuchoscore.dir/Teuchos_UnitTestRepository.cpp.o CMakeFiles/teuchoscore.dir/Teuchos_Utils.cpp.o CMakeFiles/teuchoscore.dir/Teuchosct.cpp.o CMakeFiles/teuchoscore.dir/Teuchos_VerbosityLevel.cpp.o CMakeFiles/teuchoscore.dir/Teuchos_Workspace.cpp.o CMakeFiles/teuchoscore.dir/Teuchos_dyn_cast.cpp.o CMchoscore.dir/Teuchos_stacktrace.cpp.o ``` So it seems that the option `-DCMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS=ON` only works with shared libraries, not static libraries. That is a problem for ATDM builds of Trilinos. But what about Ninja builds? ``` $ rm -r CMake* $ cmake \ -GNinja \ -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnvSettings.cmake \ -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Teuchos=ON \ -DCMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS=ON \ -DBUILD_SHARED_LIBS=ON \ ~/Trilinos.base/Trilinos \ &> configure.out $ cd /packages/teuchos/core/src/ $ make NP=16 VERBOSE=1 ``` That showed: ``` /projects/sems/install/rhel6-x86_64/sems/compiler/gcc/6.1.0/openmpi/1.10.1/bin/mpicxx -fPIC --std=c++11 -g -fopenmp -g -O0 -shared -Wl,-soname,libteuchoscore.so.12 -o packages/teuchos/core/src/libteuchoscore.so.12.13 packages/teuchos/core/src/CMakeFiles/teuchoscore.dir/Teuchos_ArrayView.cpp.o packages/teuchos/core/src/CMakeFiles/teuchoscore.dir/Teuchos_CWrapperSupport.cpp.o packages/teuchos/core/src/CMakeFiles/teuchoscore.dir/Teuchos_CommandLineProcessor.cpp.o packages/teuchos/core/src/CMakeFiles/teuchoscore.dir/Teuchos_Describable.cpp.o packages/teuchos/core/src/CMakeFiles/teuchoscore.dir/Teuchos_Details_Allocator.cpp.o packages/teuchos/core/src/CMakeFiles/teuchoscore.dir/Teuchos_GlobalMPISession.cpp.o packages/teuchos/core/src/CMakeFiles/teuchoscore.dir/Teuchos_HashUtils.cpp.o packages/teuchos/core/src/CMakeFiles/teuchoscore.dir/Teuchos_LabeledObject.cpp.o packages/teuchos/core/src/CMakeFiles/teuchoscore.dir/Teuchos_PrintDouble.cpp.o packages/teuchos/core/src/CMakeFiles/teuchoscore.dir/Teuchos_Ptr.cpp.o packages/teuchos/core/src/CMakeFiles/teuchoscore.dir/Teuchos_RCPNode.cpp.o packages/teuchos/core/src/CMakeFiles/teuchoscore.dir/Teuchos_Range1D.cpp.o packages/teuchos/core/src/CMakeFiles/teuchoscore.dir/Teuchos_ScalarTraits.cpp.o packages/teuchos/core/src/CMakeFiles/teuchoscore.dir/Teuchos_StrUtils.cpp.o packages/teuchos/core/src/CMakeFiles/teuchoscore.dir/Teuchos_TabularOutputter.cpp.o packages/teuchos/core/src/CMakeFiles/teuchoscore.dir/Teuchos_TestForException.cpp.o packages/teuchos/core/src/CMakeFiles/teuchoscore.dir/Teuchos_TestingHelpers.cpp.o packages/teuchos/core/src/CMakeFiles/teuchoscore.dir/Teuchos_Time.cpp.o packages/teuchos/core/src/CMakeFiles/teuchoscore.dir/Teuchos_TypeNameTraits.cpp.o packages/teuchos/core/src/CMakeFiles/teuchoscore.dir/Teuchos_UnitTestBase.cpp.o packages/teuchos/core/src/CMakeFiles/teuchoscore.dir/Teuchos_UnitTestRepository.cpp.o packages/teuchos/core/src/CMakeFiles/teuchoscore.dir/Teuchos_Utils.cpp.o packages/teuchos/core/src/CMakeFiles/teuchoscore.dir/Teuchos_VerboseObject.cpp.o packages/teuchos/core/src/CMakeFiles/teuchoscore.dir/Teuchos_VerbosityLevel.cpp.o packages/teuchos/core/src/CMakeFiles/teuchoscore.dir/Teuchos_Workspace.cpp.o packages/teuchos/core/src/CMakeFiles/teuchoscore.dir/Teuchos_dyn_cast.cpp.o packages/teuchos/core/src/CMakeFiles/teuchoscore.dir/Teuchos_stacktrace.cpp.o -Wl,-rpath,/ascldap/users/rabartl/Trilinos.base/BUILDS/ATDM/GNU/GNU_DEBUG_OPENMP/packages/kokkos/core/src: packages/kokkos/core/src/libkokkoscore.so.12.13 -ldl ``` Okay, so the option `-DCMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS=ON` does not work with Ninja. But Brad King at Kitware says to try `-DCMAKE_NINJA_FORCE_RESPONSE_FILE=ON` with Ninja. So I shall: ``` $ rm -r CMake* $ cmake \ -GNinja \ -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnvSettings.cmake \ -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Teuchos=ON \ -DCMAKE_NINJA_FORCE_RESPONSE_FILE=ON \ -DBUILD_SHARED_LIBS=ON \ ~/Trilinos.base/Trilinos \ &> configure.out $ cd /packages/teuchos/core/ $ make NP=16 VERBOSE=1 ``` That produced the library creation: ``` /projects/sems/install/rhel6-x86_64/sems/compiler/gcc/6.1.0/openmpi/1.10.1/bin/mpicxx -fPIC --std=c++11 -g -fopenmp -g -O0 -shared -Wl,-soname,libteuchoscore.so.12 -o packages/teuchos/core/src/libteuchoscore.so.12.13 @CMakeFiles/teuchoscore.rsp ``` I also saw compile lines like: ``` /projects/sems/install/rhel6-x86_64/sems/compiler/gcc/6.1.0/openmpi/1.10.1/bin/mpicxx @packages/teuchos/core/src/CMakeFiles/teuchoscore.dir/Teuchos_dyn_cast.cpp.o.rsp -MD -MT packages/teuchos/core/src/CMakeFiles/teuchoscore.dir/Teuchos_dyn_cast.cpp.o -MF packages/teuchos/core/src/CMakeFiles/teuchoscore.dir/Teuchos_dyn_cast.cpp.o.d -o packages/teuchos/core/src/CMakeFiles/teuchoscore.dir/Teuchos_dyn_cast.cpp.o -c /ascldap/users/rabartl/Trilinos.base/Trilinos/packages/teuchos/core/src/Teuchos_dyn_cast.cpp ``` and executable links like: ``` /projects/sems/install/rhel6-x86_64/sems/compiler/gcc/6.1.0/openmpi/1.10.1/bin/mpicxx --std=c++11 -g -fopenmp -g -O0 -rdynamic @CMakeFiles/TeuchosCore_GlobalMPISessionUnitTests.rsp -o packages/teuchos/core/test/UnitTest/TeuchosCore_GlobalMPISessionUnitTests.exe ``` But the configure output above shows: ``` CMake Warning: Manually-specified variables were not used by the project: CMAKE_NINJA_FORCE_RESPONSE_FILE ``` What the heck? That variable is clearly having an impact so why is it coming up as non-read? I will likely need to put in a TriBIT hack to read that variable to remove that warning or something. So the compile shows the usage of an `*.rsp` file for the include directories, the (shared) library creation shows the usage of an `*.rsp` file for the object files, and the executable link shows usage of an `*.rsp` file for both object files and the list of libraries. Therefore, the option `-DCMAKE_NINJA_FORCE_RESPONSE_FILE=ON` with Ninja is equivalent to setting `-DCMAKE_CXX_USE_RESPONSE_FILE_FOR_INCLUDES=ON`, `-DCMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS=ON`, and `-DCMAKE_CXX_USE_RESPONSE_FILE_FOR_LIBRARIES=ON` when using Makefiles. Why is that? So when you use Ninja, you completely loose the flexibility to set three difference usages of response files separately. I went back to 'white' and tried: ``` $ cd ~/Trilinos.base/BUILD/WHITE/CUDA/CUDA-DEBUG/ $ rm -r CMake* $ rm -r packages/* $ . load-env.sh Hostname 'white11' matches known ATDM host 'white' and system 'ride' ATDM_CONFIG_TRILNOS_DIR = /home/rabartl/Trilinos.base/Trilinos Setting default compiler and build options for JOB_NAME='cuda-debug' Using white/ride compiler stack CUDA to build DEBUG code with Kokkos node type CUDA $ time cmake -GNinja \ -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnvSettings.cmake \ -DTrilinos_ENABLE_TESTS=ON \ -DTrilinos_ENABLE_Kokkos=ON \ -DCMAKE_NINJA_FORCE_RESPONSE_FILE=ON \ -DBUILD_SHARED_LIBS=ON \ ~/Trilinos.base/Trilinos &> configure.out real 0m33.611s user 0m21.175s sys 0m6.286s $ make NP=16 ``` That fails to build with: ``` /home/projects/pwr8-rhel73-lsf/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44/bin/mpicxx @packages/kokkos/core/unit_test/CMakeFiles/KokkosCore_UnitTest_HostBarrier.dir/UnitTestMain.cpp.o.rsp -MD -MT packages/kokkos/core/unit_test/CMakeFiles/KokkosCore_UnitTest_HostBarrier.dir/UnitTestMain.cpp.o -MF packages/kokkos/core/unit_test/CMakeFiles/KokkosCore_UnitTest_HostBarrier.dir/UnitTestMain.cpp.o.d -o packages/kokkos/core/unit_test/CMakeFiles/KokkosCore_UnitTest_HostBarrier.dir/UnitTestMain.cpp.o -c /ascldap/users/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/unit_test/UnitTestMain.cpp /ascldap/users/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/unit_test/UnitTestMain.cpp:44:25: fatal error: gtest/gtest.h: No such file or directory ``` The file: ``` ./packages/kokkos/core/unit_test/CMakeFiles/KokkosCore_UnitTest_HostBarrier.dir/UnitTestMain.cpp.o.rsp ``` has the contents: ``` -I. -I/ascldap/users/rabartl/Trilinos.base/Trilinos/packages/kokkos/tpls/gtest -Ipackages/kokkos/core/src -I/ascldap/users/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/src -I/ascldap/users/projects/pwr8-rhel73-lsf/cuda/8.0.44/include -Ipackages/kokkos/core/unit_test -I/ascldap/users/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/unit_test --std=c++11 -lineinfo -g -expt-extended-lambda -arch=sm_37 -DGTEST_HAS_PTHREAD=0 -g -O0 ``` It looks like that is the correct list of include directories. I can manually run this command as ``` /home/projects/pwr8-rhel73-lsf/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44/bin/mpicxx @packages/kokkos/core/unit_test/CMakeFiles/KokkosCore_UnitTest_HostBarrier.dir/UnitTestMain.cpp.o.rsp -MD -MT packages/kokkos/core/unit_test/CMakeFiles/KokkosCore_UnitTest_HostBarrier.dir/UnitTestMain.cpp.o -MF packages/kokkos/core/unit_test/CMakeFiles/KokkosCore_UnitTest_HostBarrier.dir/UnitTestMain.cpp.o.d -o packages/kokkos/core/unit_test/CMakeFiles/KokkosCore_UnitTest_HostBarrier.dir/UnitTestMain.cpp.o -c /ascldap/users/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/unit_test/UnitTestMain.cpp /ascldap/users/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/unit_test/UnitTestMain.cpp:44:25: fatal error: gtest/gtest.h: No such file or directory ``` and with the explicit directory path to the `*.rsp` file: ``` /home/projects/pwr8-rhel73-lsf/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44/bin/mpicxx ~/Trilinos.base/BUILD/WHITE/CUDA/CUDA-DEBUG/packages/kokkos/core/unit_test/CMakeFiles/KokkosCore_UnitTest_HostBarrier.dir/UnitTestMain.cpp.o.rsp -MD -MT packages/kokkos/core/unit_test/CMakeFiles/KokkosCore_UnitTest_HostBarrier.dir/UnitTestMain.cpp.o -MF packages/kokkos/core/unit_test/CMakeFiles/KokkosCore_UnitTest_HostBarrier.dir/UnitTestMain.cpp.o.d -o packages/kokkos/core/unit_test/CMakeFiles/KokkosCore_UnitTest_HostBarrier.dir/UnitTestMain.cpp.o -c /ascldap/users/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/unit_test/UnitTestMain.cpp /ascldap/users/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/unit_test/UnitTestMain.cpp:44:25: fatal error: gtest/gtest.h: No such file or directory ``` What if I manually add back the include directories? ``` /home/projects/pwr8-rhel73-lsf/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44/bin/mpicxx -I. -I/ascldap/users/rabartl/Trilinos.base/Trilinos/packages/kokkos/tpls/gtest -Ipackages/kokkos/core/src -I/ascldap/users/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/src -I/ascldap/users/projects/pwr8-rhel73-lsf/cuda/8.0.44/include -Ipackages/kokkos/core/unit_test -I/ascldap/users/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/unit_test --std=c++11 -lineinfo -g -expt-extended-lambda -arch=sm_37 -DGTEST_HAS_PTHREAD=0 -g -O0 -MD -MT packages/kokkos/core/unit_test/CMakeFiles/KokkosCore_UnitTest_HostBarrier.dir/UnitTestMain.cpp.o -MF packages/kokkos/core/unit_test/CMakeFiles/KokkosCore_UnitTest_HostBarrier.dir/UnitTestMain.cpp.o.d -o packages/kokkos/core/unit_test/CMakeFiles/KokkosCore_UnitTest_HostBarrier.dir/UnitTestMain.cpp.o -c /ascldap/users/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/unit_test/UnitTestMain.cpp ``` That built the object file just fine. So my guess is that the `nvcc_wrapper` script is not working to handle `*.rsp` files correctly on this system. The script being used is: ``` $ set | grep nvcc_wrapper OMPI_CXX=/home/rabartl/Trilinos.base/Trilinos/packages/kokkos/bin/nvcc_wrapper ``` To see if the `nvcc_wrapper` was parsing the commandline correctly, I added debug print statements to the invocation of the script. I pushed this as the commit 10db5d9 to the branch `3069-white-cuda-debug-stokhos-build-error`. I 'white' I then ran: ``` env NVCC_WRAPPER_SHOW_COMMANDS_BEING_RUN=1 /home/projects/pwr8-rhel73-lsf/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44/bin/mpicxx @packages/kokkos/core/unit_test/CMakeFiles/KokkosCore_UnitTest_HostBarrier.dir/UnitTestMain.cpp.o.rsp -MD -MT packages/kokkos/core/unit_test/CMakeFiles/KokkosCore_UnitTest_HostBarrier.dir/UnitTestMain.cpp.o -MF packages/kokkos/core/unit_test/CMakeFiles/KokkosCore_UnitTest_HostBarrier.dir/UnitTestMain.cpp.o.d -o packages/kokkos/core/unit_test/CMakeFiles/KokkosCore_UnitTest_HostBarrier.dir/UnitTestMain.cpp.o -c /ascldap/users/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/unit_test/UnitTestMain.cpp nvcc -ccbin g++ -arch=sm_35 -I/home/projects/pwr8-rhel73-lsf/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44/include -Xcompiler -pthread -Xlinker @packages/kokkos/core/unit_test/CMakeFiles/KokkosCore_UnitTest_HostBarrier.dir/UnitTestMain.cpp.o.rsp -x cu /ascldap/users/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/unit_test/UnitTestMain.cpp -c -o packages/kokkos/core/unit_test/CMakeFiles/KokkosCore_UnitTest_HostBarrier.dir/UnitTestMain.cpp.o && nvcc -ccbin g++ -arch=sm_35 -I/home/projects/pwr8-rhel73-lsf/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44/include -Xcompiler -pthread -Xlinker @packages/kokkos/core/unit_test/CMakeFiles/KokkosCore_UnitTest_HostBarrier.dir/UnitTestMain.cpp.o.rsp -x cu /ascldap/users/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/unit_test/UnitTestMain.cpp -M -MT packages/kokkos/core/unit_test/CMakeFiles/KokkosCore_UnitTest_HostBarrier.dir/UnitTestMain.cpp.o -o packages/kokkos/core/unit_test/CMakeFiles/KokkosCore_UnitTest_HostBarrier.dir/UnitTestMain.cpp.o.d /ascldap/users/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/unit_test/UnitTestMain.cpp:44:25: fatal error: gtest/gtest.h: No such file or directory compilation terminated. ``` That shows the `nvcc` command: ``` nvcc -ccbin g++ -arch=sm_35 -I/home/projects/pwr8-rhel73-lsf/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44/include -Xcompiler -pthread -Xlinker @packages/kokkos/core/unit_test/CMakeFiles/KokkosCore_UnitTest_HostBarrier.dir/UnitTestMain.cpp.o.rsp -x cu /ascldap/users/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/unit_test/UnitTestMain.cpp -c -o packages/kokkos/core/unit_test/CMakeFiles/KokkosCore_UnitTest_HostBarrier.dir/UnitTestMain.cpp.o ``` Running just that command: ``` $ nvcc -ccbin g++ -arch=sm_35 -I/home/projects/pwr8-rhel73-lsf/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44/include -Xcompiler -pthread -Xlinker @packages/kokkos/core/unit_test/CMakeFiles/KokkosCore_UnitTest_HostBarrier.dir/UnitTestMain.cpp.o.rsp -x cu /ascldap/users/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/unit_test/UnitTestMain.cpp -c -o packages/kokkos/core/unit_test/CMakeFiles/KokkosCore_UnitTest_HostBarrier.dir/UnitTestMain.cpp.o /ascldap/users/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/unit_test/UnitTestMain.cpp:44:25: fatal error: gtest/gtest.h: No such file or directory compilation terminated. ``` Now what if I directly pass in the include files? ``` $ nvcc -ccbin g++ -arch=sm_35 -I/home/projects/pwr8-rhel73-lsf/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44/include -Xcompiler -pthread -Xlinker -I. -I/ascldap/users/rabartl/Trilinos.base/Trilinos/packages/kokkos/tpls/gtest -Ipackages/kokkos/core/src -I/ascldap/users/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/src -I/ascldap/users/projects/pwr8-rhel73-lsf/cuda/8.0.44/include -Ipackages/kokkos/core/unit_test -I/ascldap/users/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/unit_test -x cu /ascldap/users/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/unit_test/UnitTestMain.cpp -c -o packages/kokkos/core/unit_test/CMakeFiles/KokkosCore_UnitTest_HostBarrier.dir/UnitTestMain.cpp.o ``` That shows that problem is that at least with this version of `nvcc` which is: ``` $ which nvcc /home/projects/pwr8-rhel73-lsf/cuda/8.0.44/bin/nvcc $ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2016 NVIDIA Corporation Built on Sat_Sep__3_19:09:38_CDT_2016 Cuda compilation tools, release 8.0, V8.0.44 ``` that you can't use resource files. Therefore, this is not an issue with CMake or Ninja or even the nvcc_wrapper. Now what if we tried this with Makefiles with only the option `-DCMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS=ON`? That should avoid using resource files for compilation and should only use them for creating libraries. Would that work? Let's try that: ``` $ cd ~/Trilinos.base/BUILD/WHITE/CUDA/CUDA-DEBUG/ $ rm -r CMake* $ rm -r packages/* $ . load-env.sh Hostname 'white11' matches known ATDM host 'white' and system 'ride' ATDM_CONFIG_TRILNOS_DIR = /home/rabartl/Trilinos.base/Trilinos Setting default compiler and build options for JOB_NAME='cuda-debug' Using white/ride compiler stack CUDA to build DEBUG code with Kokkos node type CUDA $ time cmake \ -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnvSettings.cmake \ -DTrilinos_ENABLE_TESTS=ON \ -DTrilinos_ENABLE_Kokkos=ON \ -DTrilinos_ENABLE_Teuchos=ON \ -DCMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS=ON \ -DBUILD_SHARED_LIBS=ON \ ~/Trilinos.base/Trilinos &> configure.out real 1m3.835s user 0m35.307s sys 0m10.871s $ make -j16 ``` That seems to be building fine and using resource files only for the creation of libraries shown by: ``` /home/projects/pwr8-rhel73-lsf/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44/bin/mpicxx -fPIC --std=c++11 -lineinfo -g -expt-extended-lambda -arch=sm_37 -g -O0 -shared -Wl,-soname,libteuchoscore.so.12 -o libteuchoscore.so.12.13 @CMakeFiles/teuchoscore.dir/objects1.rsp -Wl,-rpath,/ascldap/users/rabartl/Trilinos.base/BUILD/WHITE/CUDA/CUDA-DEBUG/packages/kokkos/core/src: ../../../kokkos/core/src/libkokkoscore.so.12.13 -ldl /ascldap/users/projects/pwr8-rhel73-lsf/cuda/8.0.44/lib64/libcudart.so /ascldap/users/projects/pwr8-rhel73-lsf/cuda/8.0.44/lib64/libcublas.so /ascldap/users/projects/pwr8-rhel73-lsf/cuda/8.0.44/lib64/libcufft.so ``` And the compile with nvcc wrapper does not use `*.rsp` files: ``` env NVCC_WRAPPER_SHOW_COMMANDS_BEING_RUN=1 make VERBOSE=1 Teuchos_dyn_cast.o cd /ascldap/users/rabartl/Trilinos.base/BUILD/WHITE/CUDA/CUDA-DEBUG && make -f packages/teuchos/core/src/CMakeFiles/teuchoscore.dir/build.make packages/teuchos/core/src/CMakeFiles/teuchoscore.dir/Teuchos_dyn_cast.cpp.o make[1]: Entering directory `/home/rabartl/Trilinos.base/BUILD/WHITE/CUDA/CUDA-DEBUG' Building CXX object packages/teuchos/core/src/CMakeFiles/teuchoscore.dir/Teuchos_dyn_cast.cpp.o cd /ascldap/users/rabartl/Trilinos.base/BUILD/WHITE/CUDA/CUDA-DEBUG/packages/teuchos/core/src && /home/projects/pwr8-rhel73-lsf/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44/bin/mpicxx -DTEUCHOSCORE_LIB_EXPORTS_MODE -Dteuchoscore_EXPORTS -I/ascldap/users/rabartl/Trilinos.base/BUILD/WHITE/CUDA/CUDA-DEBUG -I/ascldap/users/rabartl/Trilinos.base/BUILD/WHITE/CUDA/CUDA-DEBUG/packages/teuchos/core/src -I/ascldap/users/rabartl/Trilinos.base/Trilinos/packages/teuchos/core/src -I/ascldap/users/rabartl/Trilinos.base/BUILD/WHITE/CUDA/CUDA-DEBUG/packages/kokkos/core/src -I/ascldap/users/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/src -I/ascldap/users/projects/pwr8-rhel73-lsf/cuda/8.0.44/include -I/ascldap/users/projects/pwr8-rhel73-lsf/boost/1.60.0/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44/include --std=c++11 -lineinfo -g -expt-extended-lambda -arch=sm_37 -g -O0 -fPIC -o CMakeFiles/teuchoscore.dir/Teuchos_dyn_cast.cpp.o -c /ascldap/users/rabartl/Trilinos.base/Trilinos/packages/teuchos/core/src/Teuchos_dyn_cast.cpp nvcc -expt-extended-lambda -arch=sm_37 -ccbin g++ -DTEUCHOSCORE_LIB_EXPORTS_MODE -Dteuchoscore_EXPORTS -I/ascldap/users/rabartl/Trilinos.base/BUILD/WHITE/CUDA/CUDA-DEBUG -I/ascldap/users/rabartl/Trilinos.base/BUILD/WHITE/CUDA/CUDA-DEBUG/packages/teuchos/core/src -I/ascldap/users/rabartl/Trilinos.base/Trilinos/packages/teuchos/core/src -I/ascldap/users/rabartl/Trilinos.base/BUILD/WHITE/CUDA/CUDA-DEBUG/packages/kokkos/core/src -I/ascldap/users/rabartl/Trilinos.base/Trilinos/packages/kokkos/core/src -I/ascldap/users/projects/pwr8-rhel73-lsf/cuda/8.0.44/include -I/ascldap/users/projects/pwr8-rhel73-lsf/boost/1.60.0/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44/include --std=c++11 -lineinfo -g -g -O0 -I/home/projects/pwr8-rhel73-lsf/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44/include -Xcompiler -fPIC,-pthread -x cu /ascldap/users/rabartl/Trilinos.base/Trilinos/packages/teuchos/core/src/Teuchos_dyn_cast.cpp -c -o CMakeFiles/teuchoscore.dir/Teuchos_dyn_cast.cpp.o make[1]: Leaving directory `/home/rabartl/Trilinos.base/BUILD/WHITE/CUDA/CUDA-DEBUG' ``` See, `nvcc` does **not** get a `*.rsp` file so it works just fine. Now to try the full build of Stokhos using Makefiles and just `-DCMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS=ON` and with shared libs, so that we can use resource files: ``` $ cd ~/Trilinos.base/BUILD/WHITE/CUDA/CUDA-DEBUG/ $ rm -r CMake* $ time cmake \ -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnvSettings.cmake \ -DTrilinos_ENABLE_TESTS=ON \ -DTrilinos_ENABLE_Stokhos=ON \ -DTrilinos_ENABLE_Pamgen=OFF \ -DTrilinos_ENABLE_Shards=OFF \ -DCMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS=ON \ -DBUILD_SHARED_LIBS=ON \ ~/Trilinos.base/Trilinos \ &> configure.out real 1m48.224s user 1m1.815s sys 0m18.163s $ time make -j16 &> make.out real 91m32.072s user 944m36.887s sys 92m48.480s $ bsub -x -Is -q rhel7F -n 16 ctest -j16 &> ctest.out ``` (NOTE: Above, I added the disables for Pamgen and Shards since those yielded build errors from `nvcc` lacking input files.) That passed the build but the tests returned: ``` 20% tests passed, 67 tests failed out of 84 Subproject Time Summary: Stokhos = 3441.23 sec*proc (84 tests) Total Test time (real) = 757.93 sec ``` A bunch of these tests are segfaulting. I will try the full build all the PT packages and see how that goes.
mhoemmen commented 6 years ago

@micahahoward practically speaking this means we will never get a CUDA debug build until we can significantly reduce build sizes and/or support dynamic libraries :(

bartlettroscoe commented 6 years ago

When trying to build all of the Primary tested packages with:

$ cd ~/Trilinos.base/BUILD/WHITE/CUDA/CUDA-DEBUG/

. ~/Trilinos.base/Trilinos/cmake/std/atdm/load-env.sh cuda-debug
Hostname 'white11' matches known ATDM host 'white' and system 'ride'
ATDM_CONFIG_TRILNOS_DIR = /home/rabartl/Trilinos.base/Trilinos
Setting default compiler and build options for JOB_NAME='cuda-debug'
Using white/ride compiler stack CUDA to build DEBUG code with Kokkos node type CUDA

$ time cmake  \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnvAllPtPackages.cmake \
  -DTrilinos_ENABLE_TESTS=ON \
  -DTrilinos_ENABLE_ALL_PACKAGES=ON \
  -DCMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS=ON \
  -DBUILD_SHARED_LIBS=ON \
  ~/Trilinos.base/Trilinos \
  &> configure.out

real    6m55.471s
user    3m24.884s
sys     1m18.600s

$ time make -j16

we get build errors like:

$ env NVCC_WRAPPER_SHOW_COMMANDS_BEING_RUN=1 make VERBOSE=1 gtest
...
/home/projects/pwr8-rhel73-lsf/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44/bin/mpicxx -fPIC     --std=c++11 -lineinfo -g -expt-extended-lambda -arch=sm_37   -g -O0  -shared -Wl,-soname,libgtest.so.12 -o libgtest.so.12.13 @CMakeFiles/gtest.dir/objects1.rsp -Wl,-rpath,:::::::::::::: 
nvcc  -expt-extended-lambda -arch=sm_37 -ccbin g++  --std=c++11 -lineinfo -g -g -O0 -shared -I/home/projects/pwr8-rhel73-lsf/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44/include -L/home/projects/pwr8-rhel73-lsf/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44/lib -lmpi_cxx -lmpi  -Xlinker -soname,libgtest.so.12 -Xlinker -rpath,:::::::::::::: -Xlinker -rpath -Xlinker /home/projects/pwr8-rhel73-lsf/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44/lib -Xlinker --enable-new-dtags  -Xcompiler -fPIC,-pthread  -Xlinker @CMakeFiles/gtest.dir/objects1.rsp   -o libgtest.so.12.13
nvcc fatal   : No input files specified; use option --help for more information

The problem is that, again, nvcc wrapper can't handle *.rsp files.

I will try hacking Stokhos to only use CMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS=ON for just the creation of that one library and see what happens.

bartlettroscoe commented 6 years ago

Note that the Experimental build Linux-gcc-5.3.0-OPENMPI-1.8.7_RELEASE_KOKKOS-REFACTOR_EXPERIMENTAL_CUDA-8.0.44 shown on CDash such as at:

already shows these library link failures of the packages Gtest, Shards, and Pamgen with nvcc.

bartlettroscoe commented 6 years ago

So my quest to try to get this build has hit an impasse. I got the stokhos_muelu library to link using BUILD_SHARED_LIBS=ON and CMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS=ON but other links fail because some of the TPLs (Boost) are not build with -fPIC on this system and therefore, can't be used with downstream shared libs we are trying to build in Trilinos.

One way to fix this might be to split stokhos_muelu into multiple static libraries and then link them all to stokhos_muelu. I think we should give that a try.

But really we should get Kitware to extend CMake to fix this problem with CMake itself. They could extend CMake to call ar multiple times to incrementally create a static library, passing in subsets of object files each time.

We also need to start a conversation in ATDM about support shared library builds. CMake just works better with shared libraries that static libraries and has more tools for the former.

DETAILED NOTES: (click to expand) Trying it again but this time set `CMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS=ON` for just the `Stokhos` libraries. If this works, then I may add a TriBITS option to allow you to set this option on a library-by-library basis. I added the commit: ``` commit be911a8a0353ba4f814556320d8066502fea8fef Author: Roscoe A. Bartlett Date: Sat Jul 7 10:55:22 2018 -0600 Set CMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS=ON just for Stokhos libs (#3069) See the detailed comment in the file for the justification. This addresses the "Truncation" error with the creation of the stokhos_muelu library described in #3069. M packages/stokhos/src/CMakeLists.txt ``` Now trying to build again, this time without setting `CMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS=ON` at the global level: ``` $ cd ~/Trilinos.base/BUILD/WHITE/CUDA/CUDA-DEBUG/ . ~/Trilinos.base/Trilinos/cmake/std/atdm/load-env.sh cuda-debug Hostname 'white11' matches known ATDM host 'white' and system 'ride' ATDM_CONFIG_TRILNOS_DIR = /home/rabartl/Trilinos.base/Trilinos Setting default compiler and build options for JOB_NAME='cuda-debug' Using white/ride compiler stack CUDA to build DEBUG code with Kokkos node type CUDA $ time cmake \ -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnvAllPtPackages.cmake \ -DTrilinos_ENABLE_TESTS=ON \ -DTrilinos_ENABLE_ALL_PACKAGES=ON \ -DBUILD_SHARED_LIBS=ON \ ~/Trilinos.base/Trilinos \ &> configure.out real 6m40.927s user 2m45.112s sys 1m5.615s $ make -j16 ``` That fails with the build failure: ``` Linking CXX shared library libstk_util_env.so /usr/bin/ld: /home/projects/pwr8-rhel73-lsf/boost/1.60.0/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44/lib/libboost_program_options.a(value_semantic.o): In function `boost::program_options::multiple_occurrences::multiple_occurrences()': value_semantic.cpp:(.text._ZN5boost15program_options20multiple_occurrencesC2Ev[_ZN5boost15program_options20multiple_occurrencesC5Ev]+0x9c): call to `boost::program_options::error_with_option_name::error_with_option_name(std::__cxx11::basic_string, std::allocator > const&, std::__cxx11::basic_string, std::allocator > const&, std::__cxx11::basic_string, std::allocator > const&, int)' lacks nop, can't restore toc; recompile with -fPIC /usr/bin/ld: final link failed: Bad value collect2: error: ld returned 1 exit status make[2]: *** [packages/stk/stk_util/stk_util/environment/libstk_util_env.so.12.13] Error 1 make[1]: *** [packages/stk/stk_util/stk_util/environment/CMakeFiles/stk_util_env.dir/all] Error 2 ``` It looks like the TPLs are not build with -FPIC and therefore can't be used to build shared libs in Trilinos. So it looks like we can't use shared libraries to try to get this build to pass.
micahahoward commented 6 years ago

For our builds with CUDA we use dynamically linked libs/exe on the application side and build Trilinos static with -fPIC. But there’s no reason we can’t build Trilinos dynamic libs (for CUDA) and link against that. In fact we do already do this for at least one non-CUDA build.

Static libs is still desired and perfered for performance. @nmhamster can comment more on that. We haven’t been doing this for our CUDA builds because of our own library sizes but we’re reducing that for multiple reasons (one of which is to be able to use static libs/exe again). Not being precluded by Trilinos would be nice, so I vote to break up the stokhos_muelu lib.

bartlettroscoe commented 6 years ago

@micahahoward, what platform are you building shared libs with SPARC? The issue is that at least on 'white'/'ride', we can't build shared libs in Trilinos because it appears that some of the static TPLs were not built with -fPIC. For example, I saw the build error:

Linking CXX shared library libstk_util_env.so
/usr/bin/ld: /home/projects/pwr8-rhel73-lsf/boost/1.60.0/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44/lib/libboost_program_options.a(value_semantic.o): In function `boost::program_options::multiple_occurrences::multiple_occurrences()':
value_semantic.cpp:(.text._ZN5boost15program_options20multiple_occurrencesC2Ev[_ZN5boost15program_options20multiple_occurrencesC5Ev]+0x9c): call to `boost::program_options::error_with_option_name::error_with_option_name(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int)' lacks nop, can't restore toc; recompile with -fPIC
/usr/bin/ld: final link failed: Bad value
collect2: error: ld returned 1 exit status
make[2]: *** [packages/stk/stk_util/stk_util/environment/libstk_util_env.so.12.13] Error 1
make[1]: *** [packages/stk/stk_util/stk_util/environment/CMakeFiles/stk_util_env.dir/all] Error 2

I can try breaking stokhos_muelu lib into multiple static libraries. @etphipp, what do you think about that?

etphipp commented 6 years ago

That's perfectly fine with me. A natural way to be break things apart would be to put each execution space instantiation (Cuda, Serial, OpenMP, ...) into a separate library, although I am not sure if that will be sufficient to resolve the issue (is anything other than Cuda being instantiated in this build?). The Sacado::UQ::PCE and Sacado::MP::Vector scalar type instantiations can also be easily put in separate libraries with just some minor CMake modifications. There are actually multiple instantiations for Sacado::UQ::PCE for different ensemble sizes that could be separated, but this will require changing the instantiation logic in the source files.

bartlettroscoe commented 6 years ago

Before we break up the stokhos_muelu library, @bradking at Kitware is going to look into this error. I created the work item:

to drive this.

According to @bradking, CMake has a feature where it should automatically call ar multiple times if the commandline is too long for one call. But this feature is only implemented in the Makefile generator, not the Ninja generator. We could try the Makefile generator to see if this same error occurs. I will go ahead and give that a try just to see if that fixes this. In any case, it will provide more info. But since we prefer to use the Ninja generator for many reasons, we want that to work as well.

bartlettroscoe commented 6 years ago

FYI: @bradking diagnosed the problem and the fix is an updated BinUtils module load in PR #3100. After that PR is merged and we get results on CDash, we can close this issue.

bartlettroscoe commented 6 years ago

FYI: PR #3100 was just merged. Therefore, we should see this build failure clear up in the build on 7/14/2018. Putting this "in review".

bartlettroscoe commented 6 years ago

After the merge of PR #3100 yesterday, the full Trilinos-atdm-white-ride-cuda-debug-pt-all-at-once build (but not tests) on 'white' and 'ride' is now passing, including the Stokhos build, as shown today here. (Note the -1 subscript by the 0 under "Build Error" for the Stokhos package.)

NOTE: The 'bsub' command crashed early on both 'white' or 'ride' so we did not get any test results on CDash. Hopefully tomorrow we will.

I am closing as complete.