Closed ndellingwood closed 6 years ago
Note the TRILINOS_PATH
variable should be modified according to users setup.
Is this the only error? Is this a new platform? Why is it showing now and not earlier? Please provide some history.
@dridzal White is a Power8 testbed with Kepler/Pascal GPUs depending on the queue. This is not new, I noticed on March 7 as part of integration testing for kokkos 2.6.00 in PR #2351. This occurred both with/without any kokkos changes snapshot into Trilinos (if Kokkos changes had caused the break it would have been addressed then). I thought I had filed an issue but did not.
Based on comments from 2.5.00 integration testing (PR #2078) on Dec 17 there were no errors reported building on White with Cuda.
I looked at the scripts for running the integration testing on White and the same devpack has been used to load modules since at least Aug 2017 and the same queue (rhel7F - Kepler GPUs) for configuring/building Trilinos, so the environment has been stable. The cause of the issue must have surfaced between Dec 17, 2017 and Mar 7, 2018.
Let me know if you need more info.
@gregvw Line 79 in this test looks like your work (considering that Drew and I don't use typedefs). Could you take a look at it and explain the dynamic and const casts? I see double &&, and don't quite get it.
@gregvw Here is the git history that may be relevant:
commit addd40ea740d4984b567d0d9d5eadc7d281d70c5
Author: gvonwin <gvonwin@sandia.gov>
Date: Mon May 14 14:08:47 2018 -0600
Replaced Teuchos::oblackholestream with ROL::nullstream
commit 363804c924d0611cfea090fd3e17a8792b486eef
Author: Greg von Winckel <gvonwin@sandia.gov>
Date: Thu Nov 30 16:58:59 2017 -0700
Changed naming convention to ROL::Ptr
commit 7efdca66fb7da251c145562f2503a69a422f1660
Author: Greg von Winckel <gregvw@gmail.com>
Date: Sun Nov 26 15:58:38 2017 -0700
- Removed explicit usage of Teuchos::RCP, rcp, dyn_cast, const_rcp_cast, rcpFromRef, etc
- Added type ROL::SharedPointer and methods ROL::makeShared and ROL::makeSharedFromRef,
ROL::staticPointerCast, ROL::dynamicPointerCast, ROL::constPointerCast.
One of these commits did it.
@dridzal - This is a weird line of code. While I do alias long types, I can't imagine intentionally using the rvalue reference declarator or const_cast here. Neither is necessary. I have replaced this line and the preceding with
auto vec = eb.get(k);
auto vp = ROL::dynamicPtrCast
@ndellingwood - Can I reproduce this error on a cee machine? I do not have access to the aforementioned module. The one gcc/cuda/openmpi module I do have access to can not compile CMake's simple test program when I run the configure script.
@gregvw - another option is to simply merge your fix and hope the test will pass on the White testbed.
@ndellingwood - what is your recommendation?
@gregvw @dridzal Thanks for looking into this quickly!
Can I reproduce this error on a cee machine
I have not tested this on a cee machine (I do not have an account), not sure if its reproduceable there. I still have my build on White, so I will make the changes you posted above and see if the compilation error goes away.
another option is to simply merge your fix and hope the test will pass on the White testbed
This is fine, assuming it passes your tests of course :)
In either case, I'll make the change in my White build and post results of the build. If Greg's change above is bundled with the other ROL install PR please don't wait for my results, since Cuda builds can be slow.
@gregvw @dridzal rebuilding didn't take long at all, the changes suggested worked nicely!
@dridzal - This was the only instance of either a const_cast or rvalue reference declarator being used anywhere in rol/test. I've pushed my change.
@ndellingwood -- we'll make it part of the next PR. Should come later this afternoon or tomorrow morning.
The bug fix is part of the recently merged PR https://github.com/trilinos/Trilinos/pull/2818. @ndellingwood -- please let us know if the cuda build works after this change.
@dridzal the cuda build works after the change - I copy/pasted the change above into Cuda code I was testing and confirmed it works. Closing the issue, thanks for the fix!
Compiling with gcc/5.4.0 + cuda/8.0.44 on the White testbed results in the following compiler error:
packages/rol/test/function/test_03.cpp(79): error: the type in a dynamic_cast must be a pointer or reference to a complete class type, or void *
@trilinos/rol
Motivation and Context
This error occurred while building Trilinos during kokkos integration testing with the current develop branch of Trilinos SHA 8bce2ccedbf7311887567badebbe93fa0ca34b5f
Definition of Done
Error goes away when compiling.
Steps to Reproduce
On the White testbed, load the modules described below, run configure script linked below, build Trilinos.
Your Environment
export JENKINS_ARCH="Power8;Kepler37"
export JENKINS_ARCH_CXX_FLAG="-mcpu=power8 -arch=sm_37 --expt-extended-lambda --std=c++11"
export JENKINS_ARCH_C_FLAG="-mcpu=power8" export BLAS_LIBRARIES="${BLAS_ROOT}/lib/libblas.a;gfortran;gomp" export LAPACK_LIBRARIES="${LAPACK_ROOT}/lib/liblapack.a;gfortran;gomp"
export JENKINS_DO_TESTS=ON export JENKINS_DO_EXAMPLES=ON
export TRILINOS_PATH="$(cd ../trilinos-update;pwd)" export OMPI_CXX=${TRILINOS_PATH}/packages/kokkos/bin/nvcc_wrapper export CUDA_LAUNCH_BLOCKING=1 export CUDA_MANAGED_FORCE_DEVICE_ALLOC=1