Closed bartlettroscoe closed 6 years ago
Is this one of those situations that calls for -D CMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS=ON
?
@mhoemmen said:
Is this one of those situations that calls for
-D CMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS=ON
?
Perhaps worth trying. This library has 441 object files totaling 53854 chars.
FYI: After some experimenting and investigation (and leaning several things about CMake that I did not know), it looks like we can't use resource files *.rsp
to handle long lists of object files for this build unless we ditch both Ninja and static libs. I am running the full build now but it looks like if we use Makefiles and shared libs, we might be able to use an *.rsp
file for the creation of these libraries. We will see in a few hours if this fixes this CUDA debug Stokhos build or not. But I am finding that moving to a shared lib build is causing build errors for some reason with a few packages like Pamgen and Shards. I don't know if it makes sense to be bothering with CUDA builds for packages like that. I am disabling non-critical packages as we go.
So there is good news and bad news. The good news is that you can make this (and any other) Stokhos
build failure go away if you set -DCMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS=ON
. The bad news is that you can't use Ninja, you can't use static libs, and you can't build with the Primary Tested packages Pamgen
or Shared
enabled. See details below.
You can't use Ninja because in order to get CMake to use *.rsp
resource files, you have to use the option -DCMAKE_NINJA_FORCE_RESPONSE_FILE=ON
which also results in *.rsp
files being used to pass in include directories to the compile. The problem is that it seems that nvcc
that comes with CUDA 8.0.44 which is used in this CUDA build on white
does not seem to support these *.rsp
resource files which are passed through by kokkos/bin/nvcc_wrapper
. You can't turn off the usage of *.rsp
files with the current CMake if you want to use them for object files. (But we can ask Kitware to fix this. I suspect no reason they can't fix this.) Therefore, you have to use the built-in CMake Makefile generator. (This results in slower less parallel builds and slower dependency analysis and other disadvantages.)
You can't use static libraries (i.e. -D BUILD_SHARED_LIBS=OFF
) because CMake does not use *.rsp
resource files because CMake uses the ar
program to create static libs and it would seem that ar
does not support *.rsp
resource files. (But I suspect that Kitware could add support to CMake to incrementally build a static lib using multiple calls to ar
to build the static libs incrementally.)
So, if we want to build Stokhos on this platform with a CUDA 8.0 debug build, we need to use Makefiles and shared libs with the packages Pamgen and Shards disabled.
I am now running a fuller build of Trilinos to see how that goes ...
@micahahoward practically speaking this means we will never get a CUDA debug build until we can significantly reduce build sizes and/or support dynamic libraries :(
When trying to build all of the Primary tested packages with:
$ cd ~/Trilinos.base/BUILD/WHITE/CUDA/CUDA-DEBUG/
. ~/Trilinos.base/Trilinos/cmake/std/atdm/load-env.sh cuda-debug
Hostname 'white11' matches known ATDM host 'white' and system 'ride'
ATDM_CONFIG_TRILNOS_DIR = /home/rabartl/Trilinos.base/Trilinos
Setting default compiler and build options for JOB_NAME='cuda-debug'
Using white/ride compiler stack CUDA to build DEBUG code with Kokkos node type CUDA
$ time cmake \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnvAllPtPackages.cmake \
-DTrilinos_ENABLE_TESTS=ON \
-DTrilinos_ENABLE_ALL_PACKAGES=ON \
-DCMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS=ON \
-DBUILD_SHARED_LIBS=ON \
~/Trilinos.base/Trilinos \
&> configure.out
real 6m55.471s
user 3m24.884s
sys 1m18.600s
$ time make -j16
we get build errors like:
$ env NVCC_WRAPPER_SHOW_COMMANDS_BEING_RUN=1 make VERBOSE=1 gtest
...
/home/projects/pwr8-rhel73-lsf/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44/bin/mpicxx -fPIC --std=c++11 -lineinfo -g -expt-extended-lambda -arch=sm_37 -g -O0 -shared -Wl,-soname,libgtest.so.12 -o libgtest.so.12.13 @CMakeFiles/gtest.dir/objects1.rsp -Wl,-rpath,::::::::::::::
nvcc -expt-extended-lambda -arch=sm_37 -ccbin g++ --std=c++11 -lineinfo -g -g -O0 -shared -I/home/projects/pwr8-rhel73-lsf/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44/include -L/home/projects/pwr8-rhel73-lsf/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44/lib -lmpi_cxx -lmpi -Xlinker -soname,libgtest.so.12 -Xlinker -rpath,:::::::::::::: -Xlinker -rpath -Xlinker /home/projects/pwr8-rhel73-lsf/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44/lib -Xlinker --enable-new-dtags -Xcompiler -fPIC,-pthread -Xlinker @CMakeFiles/gtest.dir/objects1.rsp -o libgtest.so.12.13
nvcc fatal : No input files specified; use option --help for more information
The problem is that, again, nvcc
wrapper can't handle *.rsp
files.
I will try hacking Stokhos
to only use CMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS=ON
for just the creation of that one library and see what happens.
Note that the Experimental build Linux-gcc-5.3.0-OPENMPI-1.8.7_RELEASE_KOKKOS-REFACTOR_EXPERIMENTAL_CUDA-8.0.44
shown on CDash such as at:
already shows these library link failures of the packages Gtest, Shards, and Pamgen with nvcc
.
So my quest to try to get this build has hit an impasse. I got the stokhos_muelu
library to link using BUILD_SHARED_LIBS=ON
and CMAKE_CXX_USE_RESPONSE_FILE_FOR_OBJECTS=ON
but other links fail because some of the TPLs (Boost) are not build with -fPIC
on this system and therefore, can't be used with downstream shared libs we are trying to build in Trilinos.
One way to fix this might be to split stokhos_muelu
into multiple static libraries and then link them all to stokhos_muelu
. I think we should give that a try.
But really we should get Kitware to extend CMake to fix this problem with CMake itself. They could extend CMake to call ar
multiple times to incrementally create a static library, passing in subsets of object files each time.
We also need to start a conversation in ATDM about support shared library builds. CMake just works better with shared libraries that static libraries and has more tools for the former.
For our builds with CUDA we use dynamically linked libs/exe on the application side and build Trilinos static with -fPIC. But there’s no reason we can’t build Trilinos dynamic libs (for CUDA) and link against that. In fact we do already do this for at least one non-CUDA build.
Static libs is still desired and perfered for performance. @nmhamster can comment more on that. We haven’t been doing this for our CUDA builds because of our own library sizes but we’re reducing that for multiple reasons (one of which is to be able to use static libs/exe again). Not being precluded by Trilinos would be nice, so I vote to break up the stokhos_muelu lib.
@micahahoward, what platform are you building shared libs with SPARC? The issue is that at least on 'white'/'ride', we can't build shared libs in Trilinos because it appears that some of the static TPLs were not built with -fPIC
. For example, I saw the build error:
Linking CXX shared library libstk_util_env.so
/usr/bin/ld: /home/projects/pwr8-rhel73-lsf/boost/1.60.0/openmpi/1.10.4/gcc/5.4.0/cuda/8.0.44/lib/libboost_program_options.a(value_semantic.o): In function `boost::program_options::multiple_occurrences::multiple_occurrences()':
value_semantic.cpp:(.text._ZN5boost15program_options20multiple_occurrencesC2Ev[_ZN5boost15program_options20multiple_occurrencesC5Ev]+0x9c): call to `boost::program_options::error_with_option_name::error_with_option_name(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int)' lacks nop, can't restore toc; recompile with -fPIC
/usr/bin/ld: final link failed: Bad value
collect2: error: ld returned 1 exit status
make[2]: *** [packages/stk/stk_util/stk_util/environment/libstk_util_env.so.12.13] Error 1
make[1]: *** [packages/stk/stk_util/stk_util/environment/CMakeFiles/stk_util_env.dir/all] Error 2
I can try breaking stokhos_muelu
lib into multiple static libraries. @etphipp, what do you think about that?
That's perfectly fine with me. A natural way to be break things apart would be to put each execution space instantiation (Cuda, Serial, OpenMP, ...) into a separate library, although I am not sure if that will be sufficient to resolve the issue (is anything other than Cuda being instantiated in this build?). The Sacado::UQ::PCE and Sacado::MP::Vector scalar type instantiations can also be easily put in separate libraries with just some minor CMake modifications. There are actually multiple instantiations for Sacado::UQ::PCE for different ensemble sizes that could be separated, but this will require changing the instantiation logic in the source files.
Before we break up the stokhos_muelu
library, @bradking at Kitware is going to look into this error. I created the work item:
to drive this.
According to @bradking, CMake has a feature where it should automatically call ar
multiple times if the commandline is too long for one call. But this feature is only implemented in the Makefile generator, not the Ninja generator. We could try the Makefile generator to see if this same error occurs. I will go ahead and give that a try just to see if that fixes this. In any case, it will provide more info. But since we prefer to use the Ninja generator for many reasons, we want that to work as well.
FYI: @bradking diagnosed the problem and the fix is an updated BinUtils module load in PR #3100. After that PR is merged and we get results on CDash, we can close this issue.
FYI: PR #3100 was just merged. Therefore, we should see this build failure clear up in the build on 7/14/2018. Putting this "in review".
After the merge of PR #3100 yesterday, the full Trilinos-atdm-white-ride-cuda-debug-pt-all-at-once
build (but not tests) on 'white' and 'ride' is now passing, including the Stokhos build, as shown today here. (Note the -1
subscript by the 0
under "Build Error" for the Stokhos package.)
NOTE: The 'bsub' command crashed early on both 'white' or 'ride' so we did not get any test results on CDash. Hopefully tomorrow we will.
I am closing as complete.
CC: @trilinos/stokhos, @rppawlo (Trilinos Nonlinear Solvers Product Lead)
Next Action Status
PR #3100 merged on 7/13/2018 resulted in 100% clean build (but not tests), including Stokhos, on 7/14/2018.
Description
The creation of the Stokhos library
libstokhos_muelu.a
fails in the CUDA 8.0 debug buildTrilinos-atdm-white-ride-cuda-debug-pt-all-at-once
'white' and 'ride'. The build error output for the build this morning shown here on 'white' shows:Steps to reproduce
This build error can reproduced on 'white' or 'ride' as described in the document:
The specific instructions for 'white' or 'ride' are given at:
The one difference is that this build of all of the Primary Tested Trilinos packages (that includes more package than are being used by ATDM APPs currently) does not exclude any Trilinos packages and tweaks a few other settings so it uses the file
Trilinos/cmake/std/atdm/ATDMDevEnvAllPtPackages.cmake
instead of the fileATDMDevEnv.cmake
.After cloning Trilinos, the following commands should reproduce the build failure:
I (@bartlettroscoe) just tired this on 'white' and I was able to reproduce the same build failure shown on CDash shown above.