trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.19k stars 565 forks source link

ROL - Cuda compiler error with test/function/test_03.cpp #2797

Closed ndellingwood closed 6 years ago

ndellingwood commented 6 years ago

Compiling with gcc/5.4.0 + cuda/8.0.44 on the White testbed results in the following compiler error: packages/rol/test/function/test_03.cpp(79): error: the type in a dynamic_cast must be a pointer or reference to a complete class type, or void *

@trilinos/rol

Motivation and Context

This error occurred while building Trilinos during kokkos integration testing with the current develop branch of Trilinos SHA 8bce2ccedbf7311887567badebbe93fa0ca34b5f

Definition of Done

Error goes away when compiling.

Steps to Reproduce

On the White testbed, load the modules described below, run configure script linked below, build Trilinos.

Your Environment

export JENKINS_ARCH="Power8;Kepler37"

export JENKINS_ARCH_CXX_FLAG="-mcpu=power8 -arch=sm_37 --expt-extended-lambda --std=c++11"

export JENKINS_ARCH_C_FLAG="-mcpu=power8" export BLAS_LIBRARIES="${BLAS_ROOT}/lib/libblas.a;gfortran;gomp" export LAPACK_LIBRARIES="${LAPACK_ROOT}/lib/liblapack.a;gfortran;gomp"

export JENKINS_DO_TESTS=ON export JENKINS_DO_EXAMPLES=ON

export TRILINOS_PATH="$(cd ../trilinos-update;pwd)" export OMPI_CXX=${TRILINOS_PATH}/packages/kokkos/bin/nvcc_wrapper export CUDA_LAUNCH_BLOCKING=1 export CUDA_MANAGED_FORCE_DEVICE_ALLOC=1



Issue reported/discussed in 
kokkos/kokkos#1620 and #2735 
ndellingwood commented 6 years ago

Note the TRILINOS_PATH variable should be modified according to users setup.

dridzal commented 6 years ago

Is this the only error? Is this a new platform? Why is it showing now and not earlier? Please provide some history.

ndellingwood commented 6 years ago

@dridzal White is a Power8 testbed with Kepler/Pascal GPUs depending on the queue. This is not new, I noticed on March 7 as part of integration testing for kokkos 2.6.00 in PR #2351. This occurred both with/without any kokkos changes snapshot into Trilinos (if Kokkos changes had caused the break it would have been addressed then). I thought I had filed an issue but did not.

Based on comments from 2.5.00 integration testing (PR #2078) on Dec 17 there were no errors reported building on White with Cuda.

I looked at the scripts for running the integration testing on White and the same devpack has been used to load modules since at least Aug 2017 and the same queue (rhel7F - Kepler GPUs) for configuring/building Trilinos, so the environment has been stable. The cause of the issue must have surfaced between Dec 17, 2017 and Mar 7, 2018.

Let me know if you need more info.

dridzal commented 6 years ago

@gregvw Line 79 in this test looks like your work (considering that Drew and I don't use typedefs). Could you take a look at it and explain the dynamic and const casts? I see double &&, and don't quite get it.

dridzal commented 6 years ago

@gregvw Here is the git history that may be relevant:

commit addd40ea740d4984b567d0d9d5eadc7d281d70c5
Author: gvonwin <gvonwin@sandia.gov>
Date:   Mon May 14 14:08:47 2018 -0600

    Replaced Teuchos::oblackholestream with ROL::nullstream

commit 363804c924d0611cfea090fd3e17a8792b486eef
Author: Greg von Winckel <gvonwin@sandia.gov>
Date:   Thu Nov 30 16:58:59 2017 -0700

    Changed naming convention to ROL::Ptr

commit 7efdca66fb7da251c145562f2503a69a422f1660
Author: Greg von Winckel <gregvw@gmail.com>
Date:   Sun Nov 26 15:58:38 2017 -0700

    - Removed explicit usage of Teuchos::RCP, rcp, dyn_cast, const_rcp_cast, rcpFromRef, etc
    - Added type ROL::SharedPointer and methods ROL::makeShared and ROL::makeSharedFromRef,
      ROL::staticPointerCast, ROL::dynamicPointerCast, ROL::constPointerCast.

One of these commits did it.

gregvw commented 6 years ago

@dridzal - This is a weird line of code. While I do alias long types, I can't imagine intentionally using the rvalue reference declarator or const_cast here. Neither is necessary. I have replaced this line and the preceding with

auto vec = eb.get(k); auto vp = ROL::dynamicPtrCast(vec)->getVector();

@ndellingwood - Can I reproduce this error on a cee machine? I do not have access to the aforementioned module. The one gcc/cuda/openmpi module I do have access to can not compile CMake's simple test program when I run the configure script.

dridzal commented 6 years ago

@gregvw - another option is to simply merge your fix and hope the test will pass on the White testbed.

@ndellingwood - what is your recommendation?

ndellingwood commented 6 years ago

@gregvw @dridzal Thanks for looking into this quickly!

Can I reproduce this error on a cee machine

I have not tested this on a cee machine (I do not have an account), not sure if its reproduceable there. I still have my build on White, so I will make the changes you posted above and see if the compilation error goes away.

another option is to simply merge your fix and hope the test will pass on the White testbed

This is fine, assuming it passes your tests of course :)

In either case, I'll make the change in my White build and post results of the build. If Greg's change above is bundled with the other ROL install PR please don't wait for my results, since Cuda builds can be slow.

ndellingwood commented 6 years ago

@gregvw @dridzal rebuilding didn't take long at all, the changes suggested worked nicely!

gregvw commented 6 years ago

@dridzal - This was the only instance of either a const_cast or rvalue reference declarator being used anywhere in rol/test. I've pushed my change.

dridzal commented 6 years ago

@ndellingwood -- we'll make it part of the next PR. Should come later this afternoon or tomorrow morning.

dridzal commented 6 years ago

The bug fix is part of the recently merged PR https://github.com/trilinos/Trilinos/pull/2818. @ndellingwood -- please let us know if the cuda build works after this change.

ndellingwood commented 6 years ago

@dridzal the cuda build works after the change - I copy/pasted the change above into Cuda code I was testing and confirmed it works. Closing the issue, thanks for the fix!