trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.19k stars 565 forks source link

Build and test failures in ATDM RDC builds on white and waterman #4502

Closed fryeguy52 closed 5 years ago

fryeguy52 commented 5 years ago

CC: Trilinos Product areas leads: @jwillenbring, @rppawlo, @kddevin, @mperego, @srajama1

Other CC: @bartlettroscoe @fryeguy52

Next Action Status

Next: Waiting for PR #4761 to get tested, approved, and merged ...

Description

As shown here, there are several failing tests and build errors in the builds:

all test failures

build error output

These are builds that enable cuda relocatable device code. Most of the errors look something like:

nvlink error   : Undefined reference to '_ZN6Sacado4Impl40global_sacado_cuda_memory_pool_on_deviceE' in 'packages/intrepid2/unit-test/Discretization/Basis/HDIV_HEX_In_FEM/Serial/CMakeFiles/Intrepid2_unit-test_Discretization_Basis_HDIV_HEX_In_FEM_Serial_Test_01_SLFadDouble.dir/test_01_SLFadDouble.cpp.o'

or

nvlink warning : Stack size for entry function '_ZN6Kokkos4Impl75_GLOBAL__N__51_tmpxft_000160f8_00000000_6_Kokkos_Cuda_Task_cpp1_ii_b2872e7123cuda_task_queue_executeEPNS0_9TaskQueueINS_4CudaEEEi' cannot be statically determined
collect2: fatal error: ld terminated with signal 9 [Killed]
compilation terminated.

Current Status on CDash

Steps to Reproduce

One should be able to reproduce this failure on ride or white as described in:

More specifically, the commands given for ride or white are provided at:

For the "-pt" builds <build-name>:

and for <Package> = Kokkos, KokkosKernels, Belos, etc., the commands to reproduce the build and test failures should be:

$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.s <build-name>
$ cmake \
 -GNinja \
 -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
 -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_<Package>=ON \
 $TRILINOS_DIR
$ ninja -j16
$ bsub -x -Is -q rhel7F -n 16 ctest -j16

For the other builds with <build-name>:

and for <Package> = Kokkos, KokkosKernels, Belos, etc., the commands to reproduce the build and test failures on 'white' or 'ride' should be:

$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.s <build-name>
$ cmake \
 -GNinja \
 -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
 -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_<Package>=ON \
 $TRILINOS_DIR
$ ninja -j16
$ bsub -x -Is -q rhel7F -n 16 ctest -j16

The build and test failures for the 'waterman' builds with <build-name>:

one uses:

$ cd <some_build_dir>/
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.s <build-name>
$ cmake \
 -GNinja \
 -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
 -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_<Package>=ON \
 $TRILINOS_DIR
$ ninja -j16
$ bsub -x -Is -n 20  ctest -j16
kyungjoo-kim commented 5 years ago

@bartlettroscoe I synced my branch with upstream and it has a right enabling flag for rdc.

etphipp commented 5 years ago

You should be able to do this build on any RHEL7 machine with a GPU that has the SEMS env loaded. (The only machine like that that I can get to in 'ascicgpu15'.)

Thanks @bartlettroscoe. I already tried a build on an x86 machine and get the same linking errors.

kyungjoo-kim commented 5 years ago

@ndellingwood When I am debugging intrepid2, I see the following error:

error: The closure type for a lambda ("lambda [](int)->void"

The lambda somehow misses __device__. Any idea?

kyungjoo-kim commented 5 years ago

@ndellingwood my bad.... there is typo in my configuration script.

etphipp commented 5 years ago

I think I have just determined that NVCC doesn't support shared+RDC. From the NVCC documentation:

6.3. Libraries The device linker has the ability to read the static host library formats (.a on Linux and Mac OS X, .lib on Windows). It ignores any dynamic (.so or .dll) libraries.

kyungjoo-kim commented 5 years ago

Thanks very very very much @etphipp . If we exclude errors from shared libraries, intrepid2, kokkoskernels, kokkos and tacho do not have build errors. @mperego @srajama1 @ndellingwood .

@bartlettroscoe Maybe we can remove the labels of packages ?

srajama1 commented 5 years ago

@etphipp @kyungjoo-kim : Thank you both for looking at this right away ..

bartlettroscoe commented 5 years ago

@etphipp said:

I think I have just determined that NVCC doesn't support shared+RDC

Yea, that seems to be the case:

But is that universally true? As shown here, how is it that some Kokkos test executables build and run such fine with shared libs and only 3 executables fail to link as shown here?

If CUDA does not support shared+RDC, what does this mean for SPARC that has to build with shared libs with CUDA on some systems? Can SPARC just never use RDC?

Could the RDC-related object code be split off and put into a static lib and have the final executable link that in as well as the shared libs? Is that possible? Otherwise, we can mixed shared and static libs if we are careful.

ibaned commented 5 years ago

But is that universally true? As shown here, how is it that some Kokkos test executables build and run such fine with shared libs and only 3 executables fail to link as shown here?

luck

If CUDA does not support shared+RDC, what does this mean for SPARC that has to build with shared libs with CUDA on some systems? Can SPARC just never use RDC?

If shared libraries are a requirement, RDC is out.

Could the RDC-related object code be split off and put into a static lib and have the final executable link that in as well as the shared libs? Is that possible? Otherwise, we can mixed shared and static libs if we are careful.

I think the possibility of cheating our way out of this is low, but anyone is welcome to try. If RDC is enabled, any code that does anything remotely Kokkos-related needs to be compiled with RDC.

bartlettroscoe commented 5 years ago

@ibaned said:

If shared libraries are a requirement, RDC is out.

So SPARC is out of luck on using RDC because it requires shared libs when CUDA is enabled? So we just tell SPARC developers "sorry, no RDC for you"?

etphipp commented 5 years ago

Before worrying about potential work-arounds, I think we need to better understand the requirements for shared libraries and RDC, i.e., is this a real requirement, or is it just something people would like to have. If it is in fact a real requirement, the best path forward is to get Nvidia to support it. Trying to cobble together shared and static libraries is a recipe for disaster. Furthermore, we likely only see a few linking errors because currently very little use of CUDA in Trilinos (whether through Kokkos or something else), actually uses RDC. If it was used pervasively, I seriously doubt working around the issue would even be possible.

bartlettroscoe commented 5 years ago

@etphipp said:

Before worrying about potential work-arounds, I think we need to better understand the requirements for shared libraries and RDC, i.e., is this a real requirement, or is it just something people would like to have.

We should talk about this with Micah at the next ATDM PI meeting.

Seems lame to have a system build tool that does not support shared libraries in 2019 (just another example of how bad this is). Perhaps native CUDA/GPU support in Clang will address this if NVIDIA can't?

nmhamster commented 5 years ago

@bartlettroscoe - FYI - I reached out to NVIDIA this morning to get some information/guidance on this particular issue. Will report back to you all shortly.

etphipp commented 5 years ago

@bartlettroscoe said:

Perhaps native CUDA/GPU support in Clang will address this if NVIDIA can't?

My impression is Clang uses parts of the Nvidia toolchain behind the scenes, so I suspect not, but perhaps @crtrott or @nmhamster know.

Replacing our CUDA builds with Clang is something we should pursue though. The first step is to get SEMS to add Clang 7.0 to the SEMS environments.

ibaned commented 5 years ago

Seems lame to have a system build tool that does not support shared libraries in 2019

I think it is RDC that is underdeveloped, shared libraries work fine without RDC. I would recommend applications take some time to reflect on whether relying on RDC as a requirement is such a great idea.

My impression is Clang uses parts of the Nvidia toolchain behind the scenes, so I suspect not

That's my understanding too, RDC support in Clang has lagged NVCC.

etphipp commented 5 years ago

Regarding the static-RDC builds, I still cannot reproduce any failures. I have run the Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt build (using the scripts as described in the initial comment), several times on both white and ride, and still can't get any failures.

Does the build environment differ in any way in how the nightlies are run versus these atdm scripts? All of the stokhos tests fail with something like

collect2: fatal error: ld terminated with signal 9 [Killed]
compilation terminated.

which sounds like something is killing the linker. I am guessing either memory is exhausted or cpu overload.

etphipp commented 5 years ago

Out of curiosity, are these builds done on the head node or a compute node? It doesn't seem to matter for me, as I have tried both and don't get any failures. However I am wondering if several other builds are going on at the same time overloading the machine.

mhoemmen commented 5 years ago

@ibaned wrote:

I would recommend applications take some time to reflect on whether relying on RDC as a requirement is such a great idea.

SPARC could perhaps go without, but Aria right now really needs virtual method calls on device and would need a big redesign to go without.

ibaned commented 5 years ago

In that case we should collectively put real pressure on NVIDIA to complete the RDC implementation.

mhoemmen commented 5 years ago

@vbrunini or @rrdrake might like to comment here about Aria -- if I remember right, usersubs require dlopen, even if we build static.

bartlettroscoe commented 5 years ago

@etphipp asked:

Out of curiosity, are these builds done on the head node or a compute node?

Compute nodes.

etphipp commented 5 years ago

Then I have no clue. There must be some difference in the environment between the automated tests and my builds because I can't get the builds to fail.

bartlettroscoe commented 5 years ago

@etphipp said:

I can't get the builds to fail.

Can you copy and paste your exact commands after logging onto 'white' or 'ride' so I can try to reproduce?

etphipp commented 5 years ago

On white:

TRILINOS_DIR=$HOME/project/atdm/atdm_trilinos/Trilinos

source $TRILINOS_DIR/cmake/std/atdm/load-env.sh Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt

cmake \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnvAllPtPackages.cmake \
  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_Stokhos=ON \
  $TRILINOS_DIR

make -j16 -k
hcedwar commented 5 years ago

Citation for what Roscoe noted: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#using-separate-compilation-in-cuda

says: 6.3. Libraries The device linker has the ability to read the static host library formats (.a on Linux and Mac OS X, .lib on Windows). It ignores any dynamic (.so or .dll) libraries. The --library and --library-path options can be used to pass libraries to both the device and host linker. The library name is specified without the library file extension when the --library option is used.

crtrott commented 5 years ago

@etphipp when RDC support in clang is required you need to link with nvcc.

bartlettroscoe commented 5 years ago

@hcedwar,

Any plans/hope for support for RDC with shared libs a future version of CUDA?

hcedwar commented 5 years ago

I have no info on this.

bartlettroscoe commented 5 years ago

@etphipp

Using ninja -j64 (which is what the automated builds use) I was able to reproduce the Stokhos link failures (see details below). When I backed off to ninja -j16, Stokhos built and ran fine (see details below).

So it seems the problem is that building with RDC is more sensitive to building with a higher build parallelism level. Is this due to exhausting memory on the system? Is RDC binary code larger?

In any case, I will try running a full build with a lower parallel level (I will try 32) and see what happens. But if backing off the parallel build level fixes these types of problems then that is what we will do (and suffer a longer wall-clock time to build).

Build and test results summary (click to expand) To try to reproduce Stokhos build errors on ride: ``` $ bsub -x -Is -q rhel7F -n 16 \ ./checkin-test-atdm.sh cuda-9.2-gnu-7.2.0-rdc-release-debug \ --enable-packages=Stokhos --local-do-all ***Forced exclusive execution Job <854691> is submitted to queue . <> <> *** *** ./checkin-test-atdm.sh cuda-9.2-gnu-7.2.0-rdc-release-debug --enable-packages=Stokhos --local-do-all *** ATDM_TRILINOS_DIR = '/home/rabartl/Trilinos.base/Trilinos/cmake/std/atdm/../../..' Load some env to get python, cmake, etc ... Hostname 'ride14' matches known ATDM host 'ride' and system 'ride' Setting compiler and build options for buld name 'default' Using white/ride compiler stack GNU-7.2.0 to build DEBUG code with Kokkos node type SERIAL and KOKKOS_ARCH=Power8 File local-checkin-test-defaults.py already exists, leaving it! Running configure, build, and/or testing for 1 builds: cuda-9.2-gnu-7.2.0-rdc-release-debug *** *** 0) Process build case cuda-9.2-gnu-7.2.0-rdc-release-debug *** Hostname 'ride14' matches known ATDM host 'ride' and system 'ride' Setting compiler and build options for buld name 'cuda-9.2-gnu-7.2.0-rdc-release-debug' Using white/ride compiler stack CUDA-9.2_GNU-7.2.0 to build RELEASE-DEBUG code with Kokkos node type CUDA and KOKKOS_ARCH=Power8,Kepler37 Running: checkin-test.py --st-extra-builds=cuda-9.2-gnu-7.2.0-rdc-release-debug ... ==> See output file checkin-test.cuda-9.2-gnu-7.2.0-rdc-release-debug.out + /home/rabartl/Trilinos.base/Trilinos/cmake/std/atdm/../../../cmake/tribits/ci_support/checkin-test.py '--make-options=-j 64' '--ctest-options=-j 8' --st-extra-builds=cuda-9.2-gnu-7.2.0-rdc-release-debug --default-builds= --allow-no-pull --send-email-to= --test-categories=NIGHTLY --ctest-timeout=600 --enable-packages=Stokhos --local-do-all --use-ninja --log-file=checkin-test.cuda-9.2-gnu-7.2.0-rdc-release-debug.out + ATDM_CHT_SINGLE_RETURN_CODE=1 + set +x cuda-9.2-gnu-7.2.0-rdc-release-debug: FAILED! Collect and report final results: ==> See output file checkin-test.final.out -------------------------------- FAILED (NOT READY TO PUSH): Trilinos: ride14 Thu Mar 14 12:23:17 MDT 2019 Enabled Packages: Stokhos Build test results: ------------------- 0) MPI_RELEASE_DEBUG_SHARED_PT_OPENMP => Test case MPI_RELEASE_DEBUG_SHARED_PT_OPENMP was not run! => Does not affect push readiness! (-1.00 min) 1) cuda-9.2-gnu-7.2.0-rdc-release-debug => FAILED: build failed => Not ready to push! (58.83 min) REQUESTED ACTIONS: FAILED ``` A little more detail in the file `checkin-test.cuda-9.2-gnu-7.2.0-rdc-release-debug.out`: ``` FAILED: Trilinos/cuda-9.2-gnu-7.2.0-rdc-release-debug: build failed Thu Mar 14 12:23:10 MDT 2019 Enabled Packages: Stokhos Hostname: ride14 Source Dir: /home/rabartl/Trilinos.base/Trilinos/cmake/tribits/ci_support/../../.. Build Dir: /home/rabartl/Trilinos.base/BUILDS/RIDE/CHECKIN/cuda-9.2-gnu-7.2.0-rdc-release-debug CMake Cache Varibles: -GNinja -DTrilinos_TRIBITS_DIR:PATH=/home/rabartl/Trilinos.base/Trilinos/cmake/tribits -DTrilinos_ENABLE_TESTS:BOOL=ON -DTrilinos_TEST_CATEGORIES:STRING=NIGHTLY -DTrilinos_ALLOW_NO_PACKAGES:BOOL=OFF -DDART_TESTING_TIMEOUT:STRING=600.0 -GNinja -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake -DTrilinos_TRACE_ADD_TEST=ON -DTrilinos_ENABLE_Stokhos:BOOL=ON -DTrilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=ON -DTrilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES:BOOL=OFF Make Options: -j 64 CTest Options: -j 8 Pull: Not Performed Configure: Passed (1.68 min) Build: FAILED (57.16 min) Test: FAILED (-1.00 min) ``` The build errors were like: ``` FAILED: packages/stokhos/test/UnitTest/Stokhos_QuadraturePseudoSpectralExpansionUnitTest.exe ... nvlink warning : Stack size for entry function '_ZN6Kokkos4Impl75_GLOBAL__N__51_tmpxft_0000e504_00000000_6_Kokkos_Cuda_Task_cpp1_ii_b2872e7123cuda_task_queue_executeEPNS0_9TaskQueueINS_4CudaEEEi' cannot be statically determined collect2: fatal error: ld terminated with signal 9 [Killed] compilation terminated. ``` I have a suspicion that this might be due to having too high of a build level. Trying it again but this time with `-j 16': ``` $ bsub -x -Is -q rhel7F -n 16 \ ./checkin-test-atdm.sh cuda-9.2-gnu-7.2.0-rdc-release-debug \ --enable-packages=Stokhos --make-options=-j16 --local-do-all ***Forced exclusive execution Job <854692> is submitted to queue . <> <> *** *** ./checkin-test-atdm.sh cuda-9.2-gnu-7.2.0-rdc-release-debug --enable-packages=Stokhos --make-options=-j16 --local-do-all *** ATDM_TRILINOS_DIR = '/home/rabartl/Trilinos.base/Trilinos/cmake/std/atdm/../../..' Load some env to get python, cmake, etc ... Hostname 'ride12' matches known ATDM host 'ride' and system 'ride' Setting compiler and build options for buld name 'default' Using white/ride compiler stack GNU-7.2.0 to build DEBUG code with Kokkos node type SERIAL and KOKKOS_ARCH=Power8 File local-checkin-test-defaults.py already exists, leaving it! Running configure, build, and/or testing for 1 builds: cuda-9.2-gnu-7.2.0-rdc-release-debug *** *** 0) Process build case cuda-9.2-gnu-7.2.0-rdc-release-debug *** Hostname 'ride12' matches known ATDM host 'ride' and system 'ride' Setting compiler and build options for buld name 'cuda-9.2-gnu-7.2.0-rdc-release-debug' Using white/ride compiler stack CUDA-9.2_GNU-7.2.0 to build RELEASE-DEBUG code with Kokkos node type CUDA and KOKKOS_ARCH=Power8,Kepler37 Running: checkin-test.py --st-extra-builds=cuda-9.2-gnu-7.2.0-rdc-release-debug ... ==> See output file checkin-test.cuda-9.2-gnu-7.2.0-rdc-release-debug.out + /home/rabartl/Trilinos.base/Trilinos/cmake/std/atdm/../../../cmake/tribits/ci_support/checkin-test.py '--make-options=-j 64' '--ctest-options=-j 8' --st-extra-builds=cuda-9.2-gnu-7.2.0-rdc-release-debug --default-builds= --allow-no-pull --send-email-to= --test-categories=NIGHTLY --ctest-timeout=600 --enable-packages=Stokhos --make-options=-j16 --local-do-all --use-ninja --log-file=checkin-test.cuda-9.2-gnu-7.2.0-rdc-release-debug.out + ATDM_CHT_SINGLE_RETURN_CODE=0 + set +x cuda-9.2-gnu-7.2.0-rdc-release-debug: PASSED! Collect and report final results: ==> See output file checkin-test.final.out -------------------------------- PASSED (NOT READY TO PUSH): Trilinos: ride12 Thu Mar 14 13:11:56 MDT 2019 Enabled Packages: Stokhos Build test results: ------------------- 0) MPI_RELEASE_DEBUG_SHARED_PT_OPENMP => Test case MPI_RELEASE_DEBUG_SHARED_PT_OPENMP was not run! => Does not affect push readiness! (-1.00 min) 1) cuda-9.2-gnu-7.2.0-rdc-release-debug => passed: passed=84,notpassed=0 (32.98 min) A current successful pull does *not* exist => Not ready for final push! Explanation: In order to safely push, the local working directory needs to be up-to-date with the global repo or a full integration has not been performed! REQUESTED ACTIONS: PASSED ``` with details: ``` passed: Trilinos/cuda-9.2-gnu-7.2.0-rdc-release-debug: passed=84,notpassed=0 Thu Mar 14 13:11:51 MDT 2019 Enabled Packages: Stokhos Hostname: ride12 Source Dir: /home/rabartl/Trilinos.base/Trilinos/cmake/tribits/ci_support/../../.. Build Dir: /home/rabartl/Trilinos.base/BUILDS/RIDE/CHECKIN/cuda-9.2-gnu-7.2.0-rdc-release-debug CMake Cache Varibles: -GNinja -DTrilinos_TRIBITS_DIR:PATH=/home/rabartl/Trilinos.base/Trilinos/cmake/tribits -DTrilinos_ENABLE_TESTS:BOOL=ON -DTrilinos_TEST_CATEGORIES:STRING=NIGHTLY -DTrilinos_ALLOW_NO_PACKAGES:BOOL=OFF -DDART_TESTING_TIMEOUT:STRING=600.0 -GNinja -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake -DTrilinos_TRACE_ADD_TEST=ON -DTrilinos_ENABLE_Stokhos:BOOL=ON -DTrilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=ON -DTrilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES:BOOL=OFF Make Options: -j16 CTest Options: -j 8 Pull: Not Performed Configure: Passed (1.49 min) Build: Passed (27.25 min) Test: Passed (4.24 min) 100% tests passed, 0 tests failed out of 84 Subproject Time Summary: Stokhos = 1230.82 sec*proc (84 tests) Total Test time (real) = 254.23 sec Total time for cuda-9.2-gnu-7.2.0-rdc-release-debug = 32.98 min ``` So it seems the problem is that building with RDC is more sensitive to building with a higher build parallelism level. Is this due to exhausting memory on the system? Is RDC binary code larger?
lucbv commented 5 years ago

@bartlettroscoe was this a static build?

bartlettroscoe commented 5 years ago

@lucbv asked:

@bartlettroscoe was this a static build?

https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#shared_static

lucbv commented 5 years ago

@bartlettroscoe if you deduce the configuration of a build based on its name you will quickly have bad surprises, sorry for checking instead of making assumptions...

bartlettroscoe commented 5 years ago

@lucbv said

if you deduce the configuration of a build based on its name you will quickly have bad surprises

What do you mean?

etphipp commented 5 years ago

Thanks @bartlettroscoe. I agree that it is likely too high of parallelism in the build and OS is killing the linker. That was my suspicion all along.

Btw, this could have been sorted out much sooner had the instructions in the initial comment for this issue been consistent with what the automated builds actually use.

bartlettroscoe commented 5 years ago

@etphipp said:

Btw, this could have been sorted out much sooner had the instructions in the initial comment for this issue been consistent with what the automated builds actually use.

Agreed. This was basically a copy and paste error.

bartlettroscoe commented 5 years ago

@lucbv asked:

@bartlettroscoe was this a static build?

Sorry, I misread that the first time. Static libs are the default so if you leave them out of the build name, that is what you get as described in:

Just spending too much time in ATDM where 90% of the builds are static :-)

bartlettroscoe commented 5 years ago

FYI: I will provide more details here in a bit but it appears that at least on 'ride' that if you reduce the parallel build level that the entire (implicitly) static Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt build completes successfully (passing all Stokhos and Paners tests for example) except for 29 build errors for TrlinosCouplings (that occur even with build level ninja -j1). I am now experimenting with the parallel link level (using the CMake Ninja generator "JOB POOL" feature) to see how high we can go and still get successful builds but not increasing our build wallclock time by more than necessary (which will especially will be an issue for a future CUDA PR build).

In the coming days we will:

Sorry for the the hassle these RDC builds have caused. Hopefully soon we will have clean cuda+rdc+static builds all cleaned up and promoted so that we can maintain them so they will be in place when the APPs need them. (Given that the app ASC IC Sierra is already relying on RDC being turned on Trilinos needs to be supporting RDC build right now independent of what ATDM needs or does not need.)

bartlettroscoe commented 5 years ago

Some initial experiments with just linking the 2623 *.exe targets shows that ninja -j16 was able to link all the targets in the build Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt on ride (minus TrilinosCouplings which does not build even with ninja -j1). (See details below.)

Therefore, after I implement the TriBITS option to limit the number of link processes (see https://github.com/TriBITSPub/TriBITS/issues/281) I will be able to run experiments on the full build of objects and libraries to see if we can use 64 *.o build processes but only 16 *.exe link processes at the same time and get everything to build. (If not, I will test ramping down.)

Build and test results summary (click to expand) **(3/15/2019)** Run the full `cuda-9.2-gnu-7.2.0-rdc-release-debug` build with only 32 cores: ``` $ cd /home/rabartl/Trilinos.base/BUILDS/RIDE/CTEST_S/ $ ./ctest-s-local-test-driver.sh cuda-9.2-gnu-7.2.0-rdc-release-debug *** *** ./ctest-s-local-test-driver.sh cuda-9.2-gnu-7.2.0-rdc-release-debug *** ATDM_TRILINOS_DIR = '/home/rabartl/Trilinos.base/Trilinos' Load some env to get python, cmake, etc ... Hostname 'ride6' matches known ATDM host 'ride' and system 'ride' Setting compiler and build options for buld name 'default' Using white/ride compiler stack GNU-7.2.0 to build DEBUG code with Kokkos node type SERIAL and KOKKOS_ARCH=Power8 Running builds: cuda-9.2-gnu-7.2.0-rdc-release-debug Running Jenkins driver Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug.sh ... Creating directory: Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug Creating directory: SRC_AND_BUILD *** *** ./ctest-s-local-test-driver.sh cuda-9.2-gnu-7.2.0-rdc-release-debug-pt *** ATDM_TRILINOS_DIR = '/home/rabartl/Trilinos.base/Trilinos' Load some env to get python, cmake, etc ... Hostname 'ride6' matches known ATDM host 'ride' and system 'ride' Setting compiler and build options for buld name 'default' Using white/ride compiler stack GNU-7.2.0 to build DEBUG code with Kokkos node type SERIAL and KOKKOS_ARCH=Power8 Running builds: cuda-9.2-gnu-7.2.0-rdc-release-debug-pt Running Jenkins driver Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt.sh ... Creating directory: Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt Creating directory: SRC_AND_BUILD real 480m16.583s user 0m0.887s sys 0m0.510s ``` This submitted to CDash at: * [Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt-exp](https://testing.sandia.gov/cdash-dev-view/index.php?project=Trilinos&parentid=4719063) Looking at the reuslts, one can see that the Stokhos build failures went away but there are still 40 total build failures for Panzer and TrilinosCouplings. As shown [here](https://testing.sandia.gov/cdash-dev-view/viewBuildError.php?buildid=4719063), these build failures all appear to occur when linking executbles. To verify that these failures are due to too high of a parallel link level, I ran this again with: ``` $ cd /home/rabartl/Trilinos.base/BUILDS/RIDE/CTEST_S/ $ env \ CTEST_DO_CONFIGURE=OFF \ ATDM_CONFIG_BUILD_COUNT_OVERRIDE=8 \ CTEST_START_WITH_EMPTY_BINARY_DIRECTORY=FALSE \ ./ctest-s-local-test-driver.sh cuda-9.2-gnu-7.2.0-rdc-release-debug-pt *** *** ./ctest-s-local-test-driver.sh cuda-9.2-gnu-7.2.0-rdc-release-debug-pt *** ATDM_TRILINOS_DIR = '/home/rabartl/Trilinos.base/Trilinos' Load some env to get python, cmake, etc ... Hostname 'ride6' matches known ATDM host 'ride' and system 'ride' Setting compiler and build options for buld name 'default' Using white/ride compiler stack GNU-7.2.0 to build DEBUG code with Kokkos node type SERIAL and KOKKOS_ARCH=Power8 Running builds: cuda-9.2-gnu-7.2.0-rdc-release-debug-pt Running Jenkins driver Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt.sh ... real 456m30.775s user 0m0.730s sys 0m0.413s ``` That submited to CDash: * [Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt-exp](https://testing.sandia.gov/cdash-dev-view/index.php?project=Trilinos&parentid=4724347) and this time only showed 29 build errors in TrilinosCouplings. There were no reported errors in Panzer. Now to manually try to build the rest of the TeuchosCouplings targets and see what happens: ``` $ cd Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt/SRC_AND_BUILD/BUILD/ $ . ~/Trilinos.base/Trilinos/cmake/std/atdm/load-env.sh Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt Hostname 'ride6' matches known ATDM host 'ride' and system 'ride' Setting compiler and build options for buld name 'Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt' Using white/ride compiler stack CUDA-9.2_GNU-7.2.0 to build RELEASE-DEBUG code with Kokkos node type CUDA and KOKKOS_ARCH=Power8,Kepler37 $ bsub -x -Is -q rhel7F -n 16 bash $ cd packages/trilinoscouplings/ $ time make NP=16 &> make.out real 6m28.966s user 61m51.821s sys 6m1.172s ``` That still failed. Shoot, let's try less: ``` $ time make NP=8 &> make.out real 4m50.703s user 27m20.803s sys 1m53.646s ``` Darn, that failed too. What if we build with just one process (if that does not build, then it is not a parallel build level problem): ``` $ time make NP=1 &> make.out ``` That still failed with: ``` /ascldap/users/rabartl/Trilinos.base/BUILDS/RIDE/CTEST_S/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt/SRC_AND_BUILD/Trilinos/packages/stk/stk_unit_test_utils/unit_main_lib/UnitTestMain.cpp:44: multiple definition of `main' packages/trilinoscouplings/examples/scaling/CMakeFiles/TrilinosCouplings_HybridIntrepidPoisson2D_Pamgen_Tpetra.dir/HybridIntrepidPoisson2D_Pamgen_Tpetra_main.cpp.o:/ascldap/users/rabartl/Trilinos.base/BUILDS/RIDE/CTEST_S/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt/SRC_AND_BUILD/Trilinos/packages/trilinoscouplings/examples/scaling/HybridIntrepidPoisson2D_Pamgen_Tpetra_main.cpp:80: first defined here collect2: error: ld returned 1 exit status ninja: build stopped: subcommand failed. make: *** [all] Error 1 ``` Let's build all of the targets and ignore any that don't build: ``` $ time ninja -C ../.. -j 1 -k 999999 packages/trilinoscouplings/all &> make.out real 112m56.795s user 97m46.066s sys 6m52.699s ``` That last -j1 build showed 29 build errors: ``` $ grep "FAILED" make.out | wc -l 29 ``` which were: ``` $ grep "FAILED" make.out FAILED: packages/trilinoscouplings/examples/scaling/TrilinosCouplings_HybridIntrepidPoisson2D_Pamgen_Tpetra.exe FAILED: packages/trilinoscouplings/examples/scaling/TrilinosCouplings_Example_Poisson2D_p2_tpetra.exe FAILED: packages/trilinoscouplings/examples/scaling/TrilinosCouplings_Example_Poisson2D_p2.exe FAILED: packages/trilinoscouplings/examples/scaling/TrilinosCouplings_Example_Poisson.exe FAILED: packages/trilinoscouplings/examples/scaling/TrilinosCouplings_Example_Darcy.exe FAILED: packages/trilinoscouplings/examples/scaling/TrilinosCouplings_IntrepidPoisson_Pamgen_Tpetra.exe FAILED: packages/trilinoscouplings/examples/fenl/TrilinosCouplings_fenl_ensemble.exe FAILED: packages/trilinoscouplings/examples/scaling/TrilinosCouplings_Example_Poisson2D_pn_tpetra.exe FAILED: packages/trilinoscouplings/examples/fenl/TrilinosCouplings_fenl_pce.exe FAILED: packages/trilinoscouplings/examples/ml/NonlinML/TrilinosCouplings_ml_nox_1Delasticity_example.exe FAILED: packages/trilinoscouplings/examples/epetraext/TrilinosCouplings_EpetraExt_Isorropia_LPTrans_Ex.exe FAILED: packages/trilinoscouplings/examples/scaling/TrilinosCouplings_IntrepidPoisson_Pamgen_EpetraAztecOO.exe FAILED: packages/trilinoscouplings/examples/scaling/TrilinosCouplings_epetraTpetraImportBenchmark.exe FAILED: packages/trilinoscouplings/examples/scaling/TrilinosCouplings_StructuredIntrepidPoisson_Pamgen_Tpetra.exe FAILED: packages/trilinoscouplings/examples/scaling/TrilinosCouplings_HybridIntrepidPoisson3D_Pamgen_Tpetra.exe FAILED: packages/trilinoscouplings/examples/scaling/TrilinosCouplings_Example_StabilizatedADR.exe FAILED: packages/trilinoscouplings/examples/scaling/TrilinosCouplings_Example_GradDiv.exe FAILED: packages/trilinoscouplings/examples/scaling/TrilinosCouplings_Example_Maxwell.exe FAILED: packages/trilinoscouplings/examples/scaling/TrilinosCouplings_Example_Poisson2D_pn.exe FAILED: packages/trilinoscouplings/examples/scaling/TrilinosCouplings_Example_Poisson_NoFE_Epetra.exe FAILED: packages/trilinoscouplings/examples/scaling/TrilinosCouplings_Example_Poisson2D.exe FAILED: packages/trilinoscouplings/examples/scaling/TrilinosCouplings_Example_Poisson_BlockMaterials.exe FAILED: packages/trilinoscouplings/examples/scaling/TrilinosCouplings_Example_Poisson_NoFE.exe FAILED: packages/trilinoscouplings/examples/scaling/TrilinosCouplings_Example_DivLSFEM.exe FAILED: packages/trilinoscouplings/examples/scaling/TrilinosCouplings_Example_CurlLSFEM.exe FAILED: packages/trilinoscouplings/examples/scaling/TrilinosCouplings_Example_CVFEM.exe FAILED: packages/trilinoscouplings/examples/scaling/TrilinosCouplings_IntrepidPoisson_Pamgen_Epetra.exe FAILED: packages/trilinoscouplings/examples/scaling/TrilinosCouplings_Example_Poisson_NoFE_Tpetra.exe FAILED: packages/trilinoscouplings/examples/fenl/TrilinosCouplings_fenl.exe ``` That suggests that these 29 build errors were not due to running too many parallel link jobs but are fundamental errors. Therefore, in future experiments, I will disable TrilinosCouplings to have a good baseline. **(3/16/2019)** Now to get a sense of build parallelism, let's look at the number of *.o, *.a, and *.exe files in the full buil `Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt`: ``` $ cd rabartl/Trilinos.base/BUILDS/RIDE/CTEST_S/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt/SRC_AND_BUILD/BUILD/ $ find . -type f -name "*.o" | wc -l 9585 $ find . -type f -name "*.a" | wc -l 189 $ find . -type f -name "*.exe" | wc -l 2498 ``` So the number of executables dominates the number of libraries so lumping in libraries with execuables in a lower Ninja "JOB POOL" will not slow down the wall-clock time much over just limiting the linking of executabless. But the ratio of object files to executables is a bit surprising as it is only about 4:1. Therefore, given that links are very expensive limiting the parallel link level will have a big (negative) impact on the total build wall-clock time. CMake and Ninja are flexible enough to allow creating different job pools to limit link parallelism differently in different packages (e.g. using less parallel link processes in Stokhos and Panzer) but that would add more complexity to TriBITS and would require a lot of experimentation to figure out good levels for the code. (And changes to Trilinos will increase this anyway.) Therefore, I will just focus on finding a single parallel link level that should work. **(3/18/2019)** I can do some experimentation before implementing the job pool options. To do this, I will delete all of the *.exe files and then build with a given parallel level and see if everything links. I will start high and go low. First, configure without TrilinosCouplings: ``` $ bsub -x -Is -q rhel7F -n 16 bash Job <854796> is submitted to queue . <> <> $ cd $ . bash_profile $ cd rabartl/Trilinos.base/BUILDS/RIDE/CTEST_S/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt/SRC_AND_BUILD/BUILD/ $ . ~/Trilinos.base/Trilinos/cmake/std/atdm/load-env.sh \ Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt Hostname 'ride6' matches known ATDM host 'ride' and system 'ride' Setting compiler and build options for buld name 'Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt' Using white/ride compiler stack CUDA-9.2_GNU-7.2.0 to build RELEASE-DEBUG code with Kokkos node type CUDA and KOKKOS_ARCH=Power8,Kepler37 $ time cmake -DTrilinos_ENABLE_TrilinosCouplings=OFF . &> configure.out real 6m30.749s user 2m14.448s sys 0m17.051s $ grep TrilinosCouplings configure.out ... Final set of non-enabled packages: Pliris Claps Trios Komplex TriKota Moertel PyTrilinos NewPackage TrilinosCouplings 9 ... $ time find . -type f -name "*.exe" -exec rm {} \; real 4m25.131s user 0m6.984s sys 0m5.203s $ time ninja -j32 &> make.out real 372m45.265s user 8837m22.025s sys 726m45.582s $ grep "^FAILED: " make.out | wc -l 4 ``` Okay, so that gave link failures: ``` $ grep "^FAILED: " make.out FAILED: packages/panzer/adapters-stk/test/stk_interface_test/PanzerAdaptersSTK_tCubeTetMeshFactory.exe FAILED: packages/panzer/adapters-stk/test/stk_interface_test/PanzerAdaptersSTK_tSTKInterface.exe FAILED: packages/panzer/adapters-stk/test/stk_interface_test/PanzerAdaptersSTK_tCubeHexMeshFactory.exe FAILED: packages/panzer/adapters-stk/test/panzer_workset_builder/PanzerAdaptersSTK_d_workset_builder.exe ``` The build time of 374m (6h 14m) is excessively long. It seems that, for some reason, a bunch of object files were getting built too as shown from: ``` $ head make.out [1/26484] Building CXX object packages/kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_CPUDiscovery.cpp.o [2/26484] Building CXX object packages/kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_HostBarrier.cpp.o [3/26484] Building CXX object packages/kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_Spinwait.cpp.o [4/26484] Building CXX object packages/kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_Profiling_Interface.cpp.o [5/26484] Building CXX object packages/kokkos/core/unit_test/CMakeFiles/KokkosCore_UnitTest_DefaultInit_7.dir/UnitTestMain.cpp.o [6/26484] Building CXX object packages/kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_MemoryPool.cpp.o [7/26484] Building CXX object packages/kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_Error.cpp.o [8/26484] Building CXX object packages/kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_HostThreadTeam.cpp.o [9/26484] Building CXX object packages/kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_HostSpace.cpp.o [10/26484] Building CXX object packages/kokkos/core/unit_test/CMakeFiles/KokkosCore_UnitTest_HostBarrier.dir/TestHostBarrier.cpp.o ``` Note sure what triggered that. But I will watch for that carefully next time. **(3/19/2019)** So we need to turn down the parallel link level. let's try `ninja -j16`: ``` $ bsub -x -Is -q rhel7F -n 16 bash Job <854796> is submitted to queue . <> <> $ cd $ . bash_profile $ cd rabartl/Trilinos.base/BUILDS/RIDE/CTEST_S/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt/SRC_AND_BUILD/BUILD/ $ . ~/Trilinos.base/Trilinos/cmake/std/atdm/load-env.sh \ Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt Hostname 'ride6' matches known ATDM host 'ride' and system 'ride' Setting compiler and build options for buld name 'Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt' Using white/ride compiler stack CUDA-9.2_GNU-7.2.0 to build RELEASE-DEBUG code with Kokkos node type CUDA and KOKKOS_ARCH=Power8,Kepler37 $ time rm `find . -type f -name "*.exe"` real 1m20.337s user 0m0.495s sys 0m1.344s $ time ninja -j16 &> make.out real 210m16.348s user 2334m38.127s sys 165m1.595s $ grep "^FAILED: " make.out | wc -l 0 ``` Okay, that seemed to work. So if I build again, it should build nothing, right? Trying: ``` $ time ninja -j1 ninja: no work to do. real 0m17.367s user 0m1.883s sys 0m1.122s ``` Okay, so `ninja -j16` seems to allow the cuda+rdc+static builds to work. This is not building any object files but hopefully those would not impact the links too much. In any case, we can experiment with different object and link parallel levels once I extend TriBITS for this. But wow, that is 210m to link 2623 `*.exe targets`. That is an average of 4.8s per *.exe link target! But note that as shown in [this query](https://testing.sandia.gov/cdash-dev-view/index.php?project=Trilinos&date=2019-03-19&filtercount=2&showfilters=1&filtercombine=and&field1=buildname&compare1=61&value1=Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt&field2=buildstarttime&compare2=83&value2=2019-03-10) the full RDC build `Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt` takes over 5h most days on 'white' and 'ride' while the non-RDC build `Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt` only takes about 2h 50m on most days as shown in [this query](https://testing.sandia.gov/cdash-dev-view/index.php?project=Trilinos&date=2019-03-19&filtercount=2&showfilters=1&filtercombine=and&field1=buildname&compare1=61&value1=Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug-pt&field2=buildstarttime&compare2=83&value2=2019-03-10). Why is the RDC build so much more expensive than the non-RDC build? And with needing to reduce the parallel link level, it may increase the build wallclock time even more for the RDC build.
etphipp commented 5 years ago

It looks like at least some of the TrilinosCouplings link errors are due to multiple definitions of main(). For example, the FENL example link error is:

packages/stk/stk_unit_test_utils/libstk_unit_main.a(UnitTestMain.cpp.o): In function `main':
/ascldap/users/rabartl/Trilinos.base/BUILDS/RIDE/CTEST_S/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt/SRC_AND_BUILD/Trilinos/packages/stk/stk_unit_test_utils/unit_main_lib/UnitTestMain.cpp:44: multiple definition of `main'
packages/trilinoscouplings/examples/fenl/CMakeFiles/TrilinosCouplings_fenl_ensemble.dir/main_ensemble.cpp.o:/ascldap/users/rabartl/Trilinos.base/BUILDS/RIDE/CTEST_S/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug-pt/SRC_AND_BUILD/Trilinos/packages/trilinoscouplings/examples/fenl/main_ensemble.cpp:978: first defined here
collect2: error: ld returned 1 exit status

It looks like stk/stk_unit_test_utils/libstk_unit_main.a is being included in the link when it shouldn't be. This example has no dependency on STK, so it must be happening somewhere higher up in the dependency chain. Is there somewhat in Tribits to find out which target is inserting that dependency?

bartlettroscoe commented 5 years ago

@etphipp asked:

Is there somewhat in Tribits to find out which target is inserting that dependency?

The easiest thing to do is to just enable the downstream thing you want to know the upstream dependencies for and then watch the cmake STDOUT to see what gets enabled and why. For example, you would configure with:

-D Triilnos_ENABLE_TESTS=ON \
-D Triinos_ENABLE_TrilinosCouplings=ON \

and then watch cmake STDOUT (redirected to a file).

bartlettroscoe commented 5 years ago

Actually TrilinosCouplings defines a dependency on all of STK as you can see here:

which shows:

SET(LIB_OPTIONAL_DEP_PACKAGES
  EpetraExt Isorropia Amesos AztecOO Belos Ifpack ML MueLu NOX Zoltan STK Stokhos)

Better question is what is a main() function doing inside of a package's library? That makes no sense.

etphipp commented 5 years ago

Yeah, I just saw that myself.

The main() is coming from the Unit_test_utils subpackage, which is probably just meant for unit testing, and so it probably isn't expected anyone would include it as a dependency that didn't need it.

So I guess whoever owns the examples that use STK needs to redo the dependencies to only include the relevant subpackages so that Unit_test_utils isn't included.

etphipp commented 5 years ago

Actually there is only one example in TrilinosCouplings that declares it uses STK, and it is this one in examples/scaling:

IF(${PACKAGE_NAME}_ENABLE_Epetra AND ${PACKAGE_NAME}_ENABLE_EpetraExt AND
   ${PACKAGE_NAME}_ENABLE_Amesos AND ${PACKAGE_NAME}_ENABLE_AztecOO AND
   ${PACKAGE_NAME}_ENABLE_Intrepid AND ${PACKAGE_NAME}_ENABLE_ML AND
   ${PACKAGE_NAME}_ENABLE_SEACAS AND ${PACKAGE_NAME}_ENABLE_STK
   AND ${PACKAGE_NAME}_ENABLE_STKIO AND ${PACKAGE_NAME}_ENABLE_STKMesh)
  TRIBITS_ADD_EXECUTABLE(
    Example_Poisson_STK
    SOURCES example_Poisson_stk.cpp
    )
  TRIBITS_COPY_FILES_TO_BINARY_DIR(CopyMeshFilesSTK
  SOURCE_FILES unit_cube_10int_hex.exo
               unit_cube_5int_tet.exo
  SOURCE_DIR ${CMAKE_CURRENT_SOURCE_DIR}
  DEST_DIR ${CMAKE_CURRENT_BINARY_DIR}
  )
ENDIF()

If that logic is actually correct, it only needs to depend on STKIO and STKMesh, and they should be optional test dependencies and not library dependencies.

etphipp commented 5 years ago

According to git-blame, @jhux2 created that example in this commit.

alanw0 commented 5 years ago

Yes we've run into trouble before with that "main library". At one point it seemed like it made sense in sierra's bjam system to put a main in a library so that multiple unit-test executables could reuse it. But in the trilinos system it seems to cause more trouble than it's worth...

bartlettroscoe commented 5 years ago

@etphipp said:

The main() is coming from the Unit_test_utils subpackage, which is probably just meant for unit testing, and so it probably isn't expected anyone would include it as a dependency that didn't need it.

Then it should be a test-only library, not a regular library. See TESTONLY and TESTONLYLIBS.

If that logic is actually correct, it only needs to depend on STKIO and STKMesh, and they should be optional test dependencies and not library dependencies.

Sounds like a 2-line change in the Dependencies.cmake to fix that. Can someone fix this?

alanw0 commented 5 years ago

Sounds perfect, I didn't know about that. Is TESTONLY a tribits thing or a cmake thing? In any case, I'll try to add that asap.

bartlettroscoe commented 5 years ago

@alanw0 asked:

Is TESTONLY a tribits thing or a cmake thing?

It is a TriBITS thing. Since TriBITS installs regular libraries and links these to downstream libs and executables by default, there had to be a way to opt-out of that. The only purpose for a library that is not installed and not linked to production libraries and exectuables is to use them in tests and examples in the build tree and hence "TESTONLY".

If you grep:

$ cd Trilinos/
$ find . -name "CMakeListx.txt" -exec grep -nH TESTONLY {} \;

you should see several examples of this usage.

There is even a tested example in TribitsExampleProject.

Let me know if you have any problems or questions with this.

alanw0 commented 5 years ago

Thanks Ross @bartlettroscoe , I've got a build going now, and if it looks good I'll put in a pull-request.

bartlettroscoe commented 5 years ago

@alanw0 said:

I've got a build going now, and if it looks good I'll put in a pull-request.

Are you only adding 'TESTONLY' to some STK libs or are you also fixing the dependencies in TrilinosCouplings? If you are not fixing TrilinosCouplings dependencies on STK, I will go ahead and fix that, test it locally, and put in a separate PR.