trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.21k stars 564 forks source link

Build and test failures in ATDM RDC builds on white and waterman #4502

Closed fryeguy52 closed 5 years ago

fryeguy52 commented 5 years ago

CC: Trilinos Product areas leads: @jwillenbring, @rppawlo, @kddevin, @mperego, @srajama1

Other CC: @bartlettroscoe @fryeguy52

Next Action Status

Next: Waiting for PR #4761 to get tested, approved, and merged ...

Description

As shown here, there are several failing tests and build errors in the builds:

all test failures

build error output

These are builds that enable cuda relocatable device code. Most of the errors look something like:

nvlink error   : Undefined reference to '_ZN6Sacado4Impl40global_sacado_cuda_memory_pool_on_deviceE' in 'packages/intrepid2/unit-test/Discretization/Basis/HDIV_HEX_In_FEM/Serial/CMakeFiles/Intrepid2_unit-test_Discretization_Basis_HDIV_HEX_In_FEM_Serial_Test_01_SLFadDouble.dir/test_01_SLFadDouble.cpp.o'

or

nvlink warning : Stack size for entry function '_ZN6Kokkos4Impl75_GLOBAL__N__51_tmpxft_000160f8_00000000_6_Kokkos_Cuda_Task_cpp1_ii_b2872e7123cuda_task_queue_executeEPNS0_9TaskQueueINS_4CudaEEEi' cannot be statically determined
collect2: fatal error: ld terminated with signal 9 [Killed]
compilation terminated.

Current Status on CDash

Steps to Reproduce

One should be able to reproduce this failure on ride or white as described in:

More specifically, the commands given for ride or white are provided at:

For the "-pt" builds <build-name>:

and for <Package> = Kokkos, KokkosKernels, Belos, etc., the commands to reproduce the build and test failures should be:

$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.s <build-name>
$ cmake \
 -GNinja \
 -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
 -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_<Package>=ON \
 $TRILINOS_DIR
$ ninja -j16
$ bsub -x -Is -q rhel7F -n 16 ctest -j16

For the other builds with <build-name>:

and for <Package> = Kokkos, KokkosKernels, Belos, etc., the commands to reproduce the build and test failures on 'white' or 'ride' should be:

$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.s <build-name>
$ cmake \
 -GNinja \
 -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
 -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_<Package>=ON \
 $TRILINOS_DIR
$ ninja -j16
$ bsub -x -Is -q rhel7F -n 16 ctest -j16

The build and test failures for the 'waterman' builds with <build-name>:

one uses:

$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.s <build-name>
$ cmake \
 -GNinja \
 -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
 -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_<Package>=ON \
 $TRILINOS_DIR
$ ninja -j16
$ bsub -x -Is -n 20  ctest -j16
etphipp commented 5 years ago

It appears to me that none of this is related to Stokhos. All of the build errors are of the form:

nvlink warning : Stack size for entry function '_ZN6Kokkos4Impl75_GLOBAL__N__51_tmpxft_0000fa17_00000000_6_Kokkos_Cuda_Task_cpp1_ii_b2872e7123cuda_task_queue_executeEPNS0_9TaskQueueINS_4CudaEEEi' cannot be statically determined
collect2: fatal error: ld terminated with signal 9 [Killed]
compilation terminated.

which looks to be some issue with Kokkos. Furthermore, all of the test failures are of the form

--------------------------------------------------------------------------
mpiexec was unable to launch the specified application as it could not access
or execute an executable:

Executable: /home/jenkins/white/workspace/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt/SRC_AND_BUILD/BUILD/packages/stokhos/test/UnitTest/Stokhos_ConstantExpansionUnitTest.exe
Node: white22

while attempting to start process rank 0.
--------------------------------------------------------------------------

I don't know if that is a filesystem issue, or the executables were never created.

bartlettroscoe commented 5 years ago

FYI: This is essentially the Trilinos PR CUDA build so this error would mean that we either can't turn on RDC for the CUDA PR build or we would need to exclude Stokhos from the CUDA PR build.

rppawlo commented 5 years ago

This is the exact error all the panzer tests report when building with RDC too.

etphipp commented 5 years ago

I haven't had a chance to run this build myself, but I suspect the issue is due to going over the 4GB file limit for linking. Building shared libraries instead of static will help this substantially. Breaking libraries apart might help.

bartlettroscoe commented 5 years ago

@etphipp, i will set up some shared cuda RDC builds as well. SPARC does shared CUDA builds so we have to support shared CUDA builds anyway.

bartlettroscoe commented 5 years ago

Relates to #4501 too ...

On the branch 2598-tril-262-atdm-cuda-rdc-shared:

To git@github.com:bartlettroscoe/Trilinos.git
 * [new branch]      2598-tril-262-atdm-cuda-rdc-shared -> 2598-tril-262-atdm-cuda-rdc-shared

on 'ride' I ran:

$ env Trilinos_PACKAGES=Stokhos,Panzer,TrilinosCouplings ./ctest-s-local-test-driver.sh cuda-9.2-gnu-7.2.0-rdc-shared-release-debug-pt

***
*** ./ctest-s-local-test-driver.sh  cuda-9.2-gnu-7.2.0-rdc-shared-release-debug-pt
***

ATDM_TRILINOS_DIR = '/home/rabartl/Trilinos.base/Trilinos'

Load some env to get python, cmake, etc ...

Hostname 'ride6' matches known ATDM host 'ride' and system 'ride'
Setting compiler and build options for buld name 'default'
Using white/ride compiler stack GNU-7.2.0 to build DEBUG code with Kokkos node type SERIAL and KOKKOS_ARCH=Power8

Running builds: cuda-9.2-gnu-7.2.0-rdc-shared-release-debug-pt

Running Jenkins driver Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-shared-release-debug-pt.sh ...

real    127m56.932s
user    0m0.705s
sys     0m0.306s

This postetd resultes to:

This showed two build errors in creating shared libs, one in KokkosKernels and one in STK. But these link errors resulted in no tests being built so still not good.

I will go ahead and run the full set of package builds and see what happens.

bartlettroscoe commented 5 years ago

On the branch 2598-tril-262-atdm-cuda-rdc-shared:

To git@github.com:bartlettroscoe/Trilinos.git
 * [new branch]      2598-tril-262-atdm-cuda-rdc-shared -> 2598-tril-262-atdm-cuda-rdc-shared

on 'ride' I ran:

env ./ctest-s-local-test-driver.sh cuda-9.2-gnu-7.2.0-rdc-shared-release-debug-pt

***
*** ./ctest-s-local-test-driver.sh  cuda-9.2-gnu-7.2.0-rdc-shared-release-debug-pt
***

ATDM_TRILINOS_DIR = '/home/rabartl/Trilinos.base/Trilinos'

Load some env to get python, cmake, etc ...

Hostname 'ride6' matches known ATDM host 'ride' and system 'ride'
Setting compiler and build options for buld name 'default'
Using white/ride compiler stack GNU-7.2.0 to build DEBUG code with Kokkos node type SERIAL and KOKKOS_ARCH=Power8

Running builds: cuda-9.2-gnu-7.2.0-rdc-shared-release-debug-pt

Running Jenkins driver Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-shared-release-debug-pt.sh ...

real    217m17.182s
user    0m0.887s
sys     0m0.389s

This postetd resultes to:

This showed buld errors in the packages Kokkos, KokkosKernels, Sacado, ShyLU_Node, Intrepid, and Intrepid2.

fryeguy52 commented 5 years ago

@etphipp @rppawlo @bartlettroscoe FYI I updated the description of this issue to include the failures in all the packages in these rdc builds and closed #4501

etphipp commented 5 years ago

So I tried the static RDC build shown above in the initial comment for this issue on ride, and didn't get any build or test failures. Did something change?

bartlettroscoe commented 5 years ago

@etphipp asked:

So I tried the static RDC build shown above in the initial comment for this issue on ride, and didn't get any build or test failures. Did something change?

Was this for Stokhos on 'ride' using:

$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.s cuda-9.2-gnu-7.2.0-rdc-release-debug-pt
$ cmake \
 -GNinja \
 -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnvAllPtPackages.cmake \
 -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Stokhos=ON \
 $TRILINOS_DIR
$ ninja -j16
$ bsub -x -Is -q rhel7F -n 16 ctest -j16

?

etphipp commented 5 years ago

Yes. Only difference was I used make instead of ninja. Here were my exact commands:

TRILINOS_DIR=$HOME/Trilinos/Trilinos
source $TRILINOS_DIR/cmake/std/atdm/load-env.sh Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt
cmake   \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnvAllPtPackages.cmake \
  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Stokhos=ON   $TRILINOS_DIR
make -j16 -k
bsub -x -Is -q rhel7F -n 16 ctest -j16
bartlettroscoe commented 5 years ago

@etphipp said:

Yes. Only difference was I used make instead of ninja

Okay, I will try to reproduce the Stokhos build error using this command.

However, note that I was able to manually reproduce Sacado build errors for the cuda-9.2-rdc-shared-release-debug build on 'ride' in my account using:

$ bsub -x -Is -q rhel7F -n 16 \
  ./checkin-test-atdm.sh cuda-9.2-rdc-shared-release-debug \
  --enable-packages=Sacado --local-do-all

which produced the first build error:

[25/739] Linking CXX executable packages/sacado/test/UnitTests/Sacado_FadKokkosTests_Cuda_Hierarchical_DFad.exe
FAILED: packages/sacado/test/UnitTests/Sacado_FadKokkosTests_Cuda_Hierarchical_DFad.exe 
...
nvlink error   : Undefined reference to '_ZN6Sacado4Impl40global_sacado_cuda_memory_pool_on_deviceE' in 'packages/sacado/test/UnitTests/CMakeFiles/Sacado_FadKokkosTests_NoViewSpec_Cuda.dir/Fad_KokkosTests_NoViewSpec_Cuda.cpp.o'
nvlink error   : Undefined reference to '_ZN6Kokkos4Impl25g_device_cuda_lock_arraysE' in 'packages/sacado/test/UnitTests/CMakeFiles/Sacado_FadKokkosTests_NoViewSpec_Cuda.dir/Fad_KokkosTests_NoViewSpec_Cuda.cpp.o'

I also tried the more direct approach on 'ride' using:

$ cd cuda-9.2-rdc-shared-release-debug/

$ source /home/rabartl/Trilinos.base/Trilinos/cmake/std/atdm/load-env.sh \
   cuda-9.2-rdc-shared-release-debug

Hostname 'ride6' matches known ATDM host 'ride' and system 'ride'
Setting compiler and build options for buld name 'cuda-9.2-rdc-shared-release-debug'
Using white/ride compiler stack CUDA-9.2_GNU-7.2.0 to build RELEASE-DEBUG code with Kokkos node type CUDA and KOKKOS_ARCH=Power8,Kepler37

$ cmake -GNinja \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Sacado=ON \
   /home/rabartl/Trilinos.base/Trilinos &> configure.out

$ ninja -j16 &> make.out

and the first build error it produced was:

[170/1033] Linking CXX executable packages/sacado/test/UnitTests/Sacado_FadFadKokkosTests_Serial.exe
FAILED: packages/sacado/test/UnitTests/Sacado_FadFadKokkosTests_Serial.exe 
...
nvlink error   : Undefined reference to '_ZN6Sacado4Impl40global_sacado_cuda_memory_pool_on_deviceE' in 'packages/sacado/test/UnitTests/CMakeFiles/Sacado_FadFadKokkosTests_Serial.dir/Fad_Fad_KokkosTests_Serial.cpp.o'

@etphipp, can you see if you can reproduce Sacado failures for the 'rdc-shared' builds?

NOTE: SPARC uses shared lib builds for CUDA so this 'rdc-shared' builds might be higher priority to fix than the 'rdc-static' builds ('static' is the default). I suspect that if EMPIRE had to use 'shared' builds, they would not care if that worked but not the 'static' builds.

etphipp commented 5 years ago

OK. I am trying a ninja build for stokhos, to see if that matters. And I will try a shared-rdc build for Sacado. Note that people have been building Sacado with RDC for a while now, so I am surprised there is an issue.

Also, @rppawlo told me that EMPIRE observes some huge speedup for PIC when using static builds instead of shared, so they will very likely want static.

etphipp commented 5 years ago

Two questions @bartlettroscoe :

bartlettroscoe commented 5 years ago

@etphipp asked:

What does "pt" in the build name mean? Should I be doing pt or non-pt?

It stands for "Primary Tested". This is a slight variation of that ATDM Trilinos build that allows the enable of all "Primary Tested" Trilinos packages. That is what the current Trilinos CUDA PR build is based on (except they did not quite translate the configuration completely and they disable SEACAS and STK). So we care about that configuration in case we want to enable RDC in a Trilinos CUDA PR build.

The default ATDM Trilinos configuration pulled in with the file ATDMDevEnv.cmake disables all of the packages and subpackages that the ATDM APPs don't use. For ATDM, we really care about the non "pt" builds. But

In the build names at the top of this issue, they start with "Trilinos-atdm-white-ride", but you don't seem to be including that in the name. Does that matter?

In this case, no "Trilinos-atdm-white-ride" is ignored. The only env on 'ride' is the 'ride' env. And 'Trilinos' and 'atdm' are not recognized build-name keywords. The current set of recognized keywords are described at:

Note that people have been building Sacado with RDC for a while now, so I am surprised there is an issue.

All of the build errors seem to be in Sacado examples and tests. My guess is that current customers (assuming that is Aria) are not building Sacado tests in their CUDA builds.

Also, @rppawlo told me that EMPIRE observes some huge speedup for PIC when using static builds instead of shared, so they will very likely want static.

Okay, then we need to get both 'static' and 'shared' CUDA 'rdc' builds to work then :-)

etphipp commented 5 years ago

I can reproduce the Sacado link errors with the shared build:

nvlink error   : Undefined reference to '_ZN6Sacado4Impl40global_sacado_cuda_memory_pool_on_deviceE' in 'CMakeFiles/Sacado_ELRCacheFadCommTests.dir/ELRCacheFad_CommTests.cpp.o'

However I am completely at a loss on how to fix it, since the symbol does appear to be defined in libsacado:

nm libsacado.so.12.13 | grep _ZN6Sacado4Impl40global_sacado_cuda_memory_pool_on_deviceE
0000000000040110 B _ZN6Sacado4Impl40global_sacado_cuda_memory_pool_on_deviceE

Looking at the source code in detail, both Sacado and Kokkos are using the same procedure to handle RDC when declaring a device global variable, and that process generates link errors for Kokkos as well. Hence it must not be correct, at least for Power.

@rppawlo, @vbrunini: Aren't you guys building with RDC enabled? Do you see any of these link errors with Sacado and/or Kokkos?

vbrunini commented 5 years ago

We are building with RDC, yes. We are also not seeing any of these link errors with either Sacado or Kokkos on either x86 + Cuda builds or Power + Cuda builds. We have only done static library builds though, not shared as far as I know.

bartlettroscoe commented 5 years ago

@vbrunini said:

We are building with RDC, yes. We are also not seeing any of these link errors with either Sacado or Kokkos on either x86 + Cuda builds or Power + Cuda builds. We have only done static library builds though, not shared as far as I know.

NOTE: Sacado and Kokkos build and test fine with static libs with CUDA RDC. It is other packages that have problems with static libs. All of this can be see on CDash (clock the link in the "Current Status on CDash" section above).

etphipp commented 5 years ago

I did a static CUDA RDC build for Stokhos with ninja and still can't reproduce the build failures. However I suspect the issue is the size of the libstokhos_muelu.a library, which is over 8 GB. I am going to break this library apart and also eliminate some of the default explicit instantiations in this library to make the libraries smaller (and also build faster).

rppawlo commented 5 years ago

I am pretty sure it was working with RDC+shared, but its been a probably 6 months since I last checked. The RDC requirement in phalanx was optional. For that work, I was only enabling kokkos, sacado, teuchos and phalanx. There were issues in enabling more packages that I think had to do with library sizes as @etphipp mentioned. Disabling some packages allowed the compile to get further.

bartlettroscoe commented 5 years ago

@rppawlo @kddevin @mperego @srajama1,

All of these RDC builds are up and submitting to CDash (see links in "Current Status on CDash") and there is reproducability instructions for each set of builds. Therefore, the ball is in your court now to decide on the priority to clean these up. The build errors need to be first obviously. Please consult with the ATDM PIs as the the priority and urgency of getting CUDA RDC builds working.

srajama1 commented 5 years ago

@bartlettroscoe Can you please point me to the request for adding this build in a private e-mail ?

bartlettroscoe commented 5 years ago

Trilinos Product areas leads: @jwillenbring, @rppawlo, @kddevin, @mperego, @srajama1

To give some of the context and motivation for enabling CUDA RDC from the ATDM APPs:

For EMPIRE, the issue is that there is some code that fails unless they enable RDC. Therefore, their need to enable RDC seems more urgent. (You could talk with @bathmatt more about that.)

For SPARC the main motivation to switch to support CUDA RDC is to reduce build times. Currently, to work with RDC disabled, they have to inline a lot of code with static polymophism. This may lead to fast code but it makes the CUDA build times very high which negatively impacts SPARC developer productivity. What they would like to do is to enable RDC and then have a mode where many functions could be made virtual and therefore reduce the number different template instantiations and reduce build times. They would use performance profiling to experiment with which functions could be made virtual and not impact performance too much and which would need to support either inlining (for high performance but slow build times) or use virtual functions with RDC enabled (for slower runtime performance but much faster builds). Therefore, SPARC developers would mostly use RDC builds with faster build times but SPARC users would use more inlined functions for best performance on production machines. (You could get more details from @micahahoward.)

Also, the SNL ASC IC code Sierra (not the LLNL machine Sierra) enables CUDA RDC it its builds of Trilinos and there are some STK tests that require RDC to be enabled in order to pass. (I think these must be disabled in the Trilinos CMake build currently. You could ask @alanw0 about that.)

Hope this helps.

mperego commented 5 years ago

@kyungjoo-kim , if you have some time, can you help with this issue?

kyungjoo-kim commented 5 years ago

@mperego okay. no problem. I will take care for the panzer rdc build. I do not see failure from intrepid2.

mperego commented 5 years ago

@kyungjoo-kim thanks!! Here is the one involving Intrepid2 https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=4648609&filtercount=4&showfilters=1&field1=buildstarttime&compare1=83&value1=2019-03-03&field2=buildstarttime&compare2=84&value2=2019-03-04&filtercombine=and

kyungjoo-kim commented 5 years ago

@mperego Oh.... intrepid2 has build errors. love rdc.

mhoemmen commented 5 years ago

@bartlettroscoe Aria needs RDC, since it needs to call virtual methods on device.

srajama1 commented 5 years ago

How come SD/SM don't like RDC then ? Is there a common process for all of Sierra or is it product specific within Sierra ?

srajama1 commented 5 years ago

I don't see any failures on Kokkos, Kokkos Kernels, and ShyLU in the above link ? Is there a reason why they are mentioned ?

bartlettroscoe commented 5 years ago

@srajama1 asked:

I don't see any failures on Kokkos, Kokkos Kernels, and ShyLU in the above link ? Is there a reason why they are mentioned ?

You can see build errors in Kokkos, KokkosKernels, and ShyLU_Node in this one build:

and you see Kokkos and KokkosKernels build failures in the build:

Different errors occur in different builds. (That is why there are different builds.)

Any other questions about this?

kyungjoo-kim commented 5 years ago

@bartlettroscoe Is there any way that I can see the CMakeCache output from the Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-shared-release-debug-pt build ?

I first try to reproduce the build with the above environment and I found that Kokkos_ENABLE_Cuda_Relocatable_Device_Code:BOOL=OFF. I also tested with Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt and also see that kokkos rdc is OFF. Do I need to explicitly set it up in the test ? or it is supposed to be work without the enabling flag ?

ndellingwood commented 5 years ago

More info to add - I tested Kokkos' master branch (mirroring most closely that in Trilinos) with gcc/7.2.0 and cuda/9.2.88 on White using the Pascal queue (Kepler queue looks like it may be backed up until tomorrow...) and I was unable to reproduce.

I tested Kokkos as follows: Log onto White and load modules.

cd kokkos
git checkout master
git pull
mkdir testing/White-Cuda9.2_GCC7.2-rdc-dbg
cd testing/White-Cuda9.2_GCC7.2-rdc-dbg
../../generate_makefile.bash --with-cuda --arch="Power8,Pascal60" --compiler=${HOME}/kokkos/bin/nvcc_wrapper --with-cuda-options=rdc --debug
bsub -Is -n 1 -q rhel7G bash
make unit-tests-only -j16

No build issues.

etphipp commented 5 years ago

@ndellingwood, does this build Kokkos with shared libraries? The Kokkos linking errors are specific to the shared+RDC builds. Would it be possible to do a shared+RDC build with Kokkos to see if you can reproduce it?

@bartlettroscoe, is it possible to do a shared+RDC Trilinos build on a non-Power platform? I've looked at the symbols in the Kokkos and Sacado libraries, and the symbol the linker is complaining about is present. So I am coming to the conclusion this is a toolchain issue (with either the nvidia linker or compiler), and the question is, is it specific to the power platform (which has been problematic from day one).

etphipp commented 5 years ago

Btw, I think it would be beneficial to separate this issue into two: one for shared and one for static. The issues for each package are different. Furthermore, until Kokkos can successfully link with shared+RDC, there is no point in having any other package look into it.

kyungjoo-kim commented 5 years ago

@bartlettroscoe

I can compile and test Intrepid2 with the following command.

 cmake  -GNinja  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnvAllPtPackages.cmake  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Intrepid2=ON  -D Kokkos_ENABLE_Cuda_Relocatable_Device_Code:BOOL=ON $TRILINOS_DIR

We cannot remove the warning Stack size for entry function as kokkos tasking uses some recursion (I believe). This warning won't affect trilinos.

vlink warning : Stack size for entry function '_ZN6Kokkos4Impl75_GLOBAL__N__51_tmpxft_00018012_00000000_6_Kokkos_Cuda_Task_cpp1_ii_b2872e7123cuda_task_queue_executeEPNS0_9TaskQueueINS_4CudaEEEi'
srajama1 commented 5 years ago

@kyungjoo-kim @ndellingwood : There is a Tacho failure and a Kokkos Kernels failure in the link @bartlettroscoe posted. Can you see if you can reproduce those in your builds ?

kyungjoo-kim commented 5 years ago

As I tested with Intrepid2, I am pretty much sure that the problem would be gone after enabling Kokkos_ENABLE_Cuda_Relocatable_Device_Code. I will let you know after testing both codes.

ndellingwood commented 5 years ago

@etphipp you're right, my build was static, so it didn't expose the issue. It doesn't look to me that there is an option for shared libs through the makefile build system in Kokkos, will need to go the CMake route.

etphipp commented 5 years ago

I did this myself with a non-Tribits-based CMake build of Kokkos in another software project, and got the same Kokkos linking error:

[ 54%] Linking CXX shared library lib/libgentenlib.so
nvlink error   : Undefined reference to '_ZN6Kokkos4Impl25g_device_cuda_lock_arraysE' in 'CMakeFiles/gentenlib.dir/src/Genten_FacMatrix.cpp.o'

This is also on an x86 platform, so it isn't Power-related (as I hoped).

srajama1 commented 5 years ago

Folks, we just checked with the ATDM stake holders. This feature is a "nice to have" this FY and a "must have" by Q1 of next FY. It is not a fire-drill, so we have time to clean it up.

etphipp commented 5 years ago

Frankly, I feel this must be a toolchain issue, since the symbol is present:

etphipp@elbert kokkos $ nm libkokkos.so | grep _ZN6Kokkos4Impl25g_device_cuda_lock_arraysE
0000000000366c40 B _ZN6Kokkos4Impl25g_device_cuda_lock_arraysE

and we probably need to get someone from Nvidia involved to resolve it.

bartlettroscoe commented 5 years ago

@kyungjoo-kim, still have issues reproducing the failures? All evidence on CDash is that Kokkos_ENABLE_Cuda_Relocatable_Device_Code is being set to ON looking at, for example:

cmake STDOUT for Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-shared-release-debug-pt:

--   KOKKOS_ENABLE_CUDA_RELOCATABLE_DEVICE_CODE

CMakeCache.txt file for Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-shared-release-debug-pt:

KOKKOS_ENABLE_CUDA_RELOCATABLE_DEVICE_CODE:BOOL=ON
...
Kokkos_ENABLE_Cuda_Relocatable_Device_Code:BOOL=ON
bartlettroscoe commented 5 years ago

@etphipp said:

we probably need to get someone from Nvidia involved to resolve it.

Who is our contact with NVIDA?

Note that is is also an issue on 'waterman' as well as shown here for example. So there is a good chance we would see this on the production machines too.

kyungjoo-kim commented 5 years ago

@bartlettroscoe When I tested

[kyukim @white11] tmp >  source $TRILINOS_DIR/cmake/std/atdm/load-env.sh Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt
Hostname 'white11' matches known ATDM host 'white' and system 'ride'
Setting compiler and build options for buld name 'Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt'
Using white/ride compiler stack CUDA-9.2_GNU-7.2.0 to build RELEASE-DEBUG code with Kokkos node type CUDA and KOKKOS_ARCH=Power8,Kepler37
[kyukim @white11] tmp >  cmake  -GNinja  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnvAllPtPackages.cmake  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Intrepid2=ON  $TRILINOS_DIR
-- ****************** Kokkos Settings ******************
-- Execution Spaces
--   Device Parallel: Cuda
--     Host Parallel: None
--       Host Serial: Serial
-- 
-- Architectures:
--     Power8,Kepler37
-- 
-- Enabled options
--   KOKKOS_ENABLE_SERIAL
--   KOKKOS_ENABLE_CUDA
--   KOKKOS_ENABLE_CUDA_LAMBDA
--   KOKKOS_ENABLE_CUDA_UVM
--   KOKKOS_ENABLE_DEBUG
--   KOKKOS_ENABLE_DEPRECATED_CODE

This is what I see. It is okay though. Now I make sure that CDash tests enables kokkos rdc flags and I can just test codes with enabling the flag myself.

srajama1 commented 5 years ago

@bartlettroscoe : It is @nmhamster

@nmhamster : Another one for your list of things you can nag vendors about ....

bartlettroscoe commented 5 years ago

@kyungjoo-kim, what version of Trilinos are you testing here? (You can attach the generated TrilinosRepoVersion.txt file with that info.)

kyungjoo-kim commented 5 years ago

@bartlettroscoe the atdm environment sets the following version.

[kyukim @white11] tmp > cmake --version
cmake version 3.11.2
bartlettroscoe commented 5 years ago

@kyungjoo-kim, sorry, my bad. I meant what version of Trilinos are you trying to configure and build?

bartlettroscoe commented 5 years ago

@etphipp asked:

@bartlettroscoe, is it possible to do a shared+RDC Trilinos build on a non-Power platform?

Yes, if you have access to the CEE LAN, you could try logging into 'ascicgpu15' and using a build named:

and give that a try as per:

You should be able to do this build on any RHEL7 machine with a GPU that has the SEMS env loaded. (The only machine like that that I can get to in 'ascicgpu15'.)