trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.19k stars 559 forks source link

seacas/applications/explore/explore failing to build in Trilinos-atdm-waterman-cuda-9.2-rdc-release-debug starting 2019-09-18 #6008

Closed bartlettroscoe closed 2 years ago

bartlettroscoe commented 4 years ago

CC: @trilinos/seacas, @kddevin (Trilinos Data Services Product Lead), @bartlettroscoe, @fryeguy52

## Next Action Status ## Description As shown in [this query](https://testing.sandia.gov/cdash/index.php?project=Trilinos&begin=2019-09-01&end=2019-09-30&filtercount=1&showfilters=1&field1=buildname&compare1=61&value1=Trilinos-atdm-waterman-cuda-9.2-rdc-release-debug) the SEACAS executable * `seacas/applications/explore/explore` started failing to build on testing day 2019-09-18 in the 'waterman' build: * Trilinos-atdm-waterman-cuda-9.2-rdc-release-debug showing the build error (for example [here](https://testing.sandia.gov/cdash/viewBuildError.php?buildid=5650870)): ``` packages/seacas/libraries/suplib/libsuplib.a(convert.C.o): In function `__sti____cudaRegisterAll()': tmpxft_00000d2a_00000000-5_convert.cudafe1.stub.c:11: undefined reference to `__cudaRegisterLinkedBinary_42_tmpxft_00000d2a_00000000_6_convert_cpp1_ii_convert_' collect2: error: ld returned 1 exit status ``` The new commits that were pulled the day that these failures started are show, for example, [here](https://testing.sandia.gov/cdash/viewNotes.php?buildid=5650870#!#note6). From looking over that set of commits, it seems likely the merged PR #5920. ## Current Status on CDash The status of the SEACAS build on this system can be seen on CDash in: * [`SEACAS` package status in `Trilinos-atdm-waterman-cuda-9.2-rdc-release-debug` build over last 10 days](https://testing.sandia.gov/cdash/index.php?project=Trilinos&begin=10%20days%20ago&end=now&filtercount=2&showfilters=1&filtercombine=and&field1=buildname&compare1=61&value1=Trilinos-atdm-waterman-cuda-9.2-rdc-release-debug&field2=subprojects&compare2=93&value2=SEACAS) ## Steps to Reproduce One should be able to reproduce this failure on the machine `waterman` as described in: * https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md More specifically, the commands given for the system are provided at: * https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#waterman The exact commands to reproduce this issue should be: ``` $ cd / $ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh \ Trilinos-atdm-waterman-cuda-9.2-rdc-release-debug $ cmake \ -GNinja \ -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \ -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_SEACAS=ON \ $TRILINOS_DIR $ ninja -j 20 $
bartlettroscoe commented 4 years ago

@gsjaardema,

NOTE: SEACAS is not having build problems on any other CUDA build or even the CUDA+RDC build on 'ride' as shown in this query which includes the builds:

  • Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug
  • Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-static-release-debug 
  • Trilinos-atdm-waterman-cuda-9.2-debug
  • Trilinos-atdm-waterman-cuda-9.2-opt
  • Trilinos-atdm-waterman-cuda-9.2-release-debug
  • Trilinos-atdm-waterman_cuda-9.2_fpic_static_opt
  • Trilinos-atdm-waterman_cuda-9.2_shared_opt
  • Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug
  • Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug
  • Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release
  • Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug

Also note that this executable seacas/applications/explore/explore failing to build does not seem to trigger any SEACAS test failures. This suggests that this executable is not being tested.

Do the ATDM APPs (or any SNL customer) use this seacas/applications/explore/explore exectuable? If so, is it a risk that it could be broken and one would not know it because it is not tested?

gsjaardema commented 4 years ago

Yes, the explore application is used by SNL and other customers. It is probably a risk that is it not tested, but typically if it builds it works; not the best option, but it has been sufficient for a couple decades.

I'm not sure why explore is not building on waterman. The change was to add the calling of a C++ routine from the fortran code and it is the C++ routine (convert.C) which seems to be somehow triggering a call to a cuda routine. There are a couple references to cuda or nvcc in the fmt/format.h include file in convert.C, but they do not call any cuda routines and are just turning on or off some template code that is not supported on certain compilers.

My guess is that this is being compiled with NVCC, but since it is linked into a fortran code there is a missing library that is normally added for C and C++ links. Is there a way to disable the nvcc compilation since this will never be used on the GPU? Something to add to one or more CMakeLists.txt files...?

gsjaardema commented 4 years ago
bartlettroscoe commented 4 years ago

@gsjaardema asked:

Is there a way to disable the nvcc compilation since this will never be used on the GPU? Something to add to one or more CMakeLists.txt files...?

Don't know.

@trilinos/kokkos-kernels, @trilinos/tpetra

Is there a way to tell nvcc_wrapper to not build certain files with nvcc but only use the host compiler? Looking at:

can this be done with adding --host-only?

bartlettroscoe commented 4 years ago

It seems likely that the ATDM APPs are not using this executable, at least not on 'waterman' (or we would have heard about it).

@gsjaardema, can we disable the build of this executable for now in our testing on just this one RDC build? Now that we are expecting to see a build error in this configuration, I fear that it will obscure the emergence of a new build error for this configuration.

bartlettroscoe commented 4 years ago

@gsjaardema,

I looked in all of the EMPIRE sources with:

$ cd EMPIRE/
$  find . -type f -exec grep -nH explore {} \; | grep -v "[.]git/"

and I could not find any usage of this SEACAS 'explore' exectuable in the production or test code.

I searched all of the SPARC sources with:

$ find . -type f -exec grep -nH explore {} \; | grep -v "[.]git/"

and there is mention of an explore_diff.py which looks like it depends on a program called explore in the shell path. This looks to be used in SPARC verification test suite.

Therefore, SPARC might depend on this SEACAS 'explore' executable. But SPARC is not yet (if ever) using CUDA+RDC so this failing build does not impact the ATDM customers.

Is it okay if I disable the build of this executable in just this CUDA+RDC build?

gsjaardema commented 4 years ago

@bartlettroscoe Yes, disabling this in the CUDA+RDC build would be good.

bartlettroscoe commented 4 years ago

This is disabled in PR #6121 and I manually merged to 'atdm-nightly' in the commit 1fe27b5.

Putting this in review until we get confirmation from CDash tomorrow.

bartlettroscoe commented 4 years ago

FYI: The SEACAS PR https://github.com/gsjaardema/seacas/pull/154 was merged. Now we are just waiting on the merge of PR #6121 (being held up due to broken Trilinos PR tester).

bartlettroscoe commented 4 years ago

FYI: This executable has been disabled for a long time and there does not seem to be any problems reported by any ATDM customers (likely because they are not using cuda+rdc builds). Therefore, I will add the "Stalled" label to get this off of our main list of issues.