Zoltan2: Compilation errors with cuda/10.1, UVM build in colorInterior routine

ndellingwood commented 2 years ago

Bug Report

@trilinos/zoltan2

Description

The following build error occurs in the colorInterior routine of packages/zoltan2/core/src/algorithms/color/Zoltan2_AlgHybridD1.hpp with cuda/10.1.105 builds on the Weaver testbed (Power9, Volta70) with UVM enabled:

/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/kokkos/core/src/Kokkos_View.hpp(1437): error: static assertion failed with "Incompatible View copy construction"
          detected during:
            instantiation of "Kokkos::View<DataType, Properties...>::View(const Kokkos::View<RT, RP...> &, std::enable_if<Kokkos::Impl::ViewMapping<Kokkos::View<DataType, Properties...>::traits, Kokkos::View<RT, RP...>::traits, Kokkos::ViewTraits<DataType, Properties...>::specialize>::is_assignable_data_type, void>::type *) [with DataType=int *, Properties=<Kokkos::CudaUVMSpace::memory_space>, RT=int *, RP=<Kokkos::LayoutLeft, Kokkos::Device<Kokkos::HostSpace::execution_space, Kokkos::HostSpace::memory_space>, Kokkos::MemoryTraits<0U>>]"
/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/zoltan2/core/src/algorithms/color/Zoltan2_AlgHybridD1.hpp(132): here
            instantiation of "void Zoltan2::AlgDistance1<Adapter>::colorInterior(size_t, Kokkos::View<Zoltan2::AlgDistance1<Adapter>::lno_t *, Kokkos::Device<ExecutionSpace, MemorySpace>>, Kokkos::View<Zoltan2::AlgDistance1<Adapter>::offset_t *, Kokkos::Device<ExecutionSpace, MemorySpace>>, Teuchos::RCP<Zoltan2::AlgDistance1<Adapter>::femv_t>, Kokkos::View<Zoltan2::AlgDistance1<Adapter>::lno_t *, Kokkos::Device<ExecutionSpace, MemorySpace>>, size_t, __nv_bool) [with Adapter=MueLu::MueLuGraphBaseAdapter<MueLu::GraphBase<int, longlong, Kokkos_Compat_KokkosSerialWrapperNode>, MueLu::GraphBase<int, longlong, Kokkos_Compat_KokkosSerialWrapperNode>>, ExecutionSpace=Kokkos::DefaultHostExecutionSpace, MemorySpace=Kokkos::Cuda::memory_space]"
/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/zoltan2/core/src/algorithms/color/Zoltan2_AlgHybridD1.hpp(970): here
            instantiation of "void Zoltan2::AlgDistance1<Adapter>::hybridGMB(size_t, const Teuchos::ArrayView<const Zoltan2::AlgDistance1<Adapter>::lno_t> &, const Teuchos::ArrayView<const Zoltan2::AlgDistance1<Adapter>::offset_t> &, const Teuchos::RCP<Zoltan2::AlgDistance1<Adapter>::femv_t> &, const Teuchos::ArrayView<const Zoltan2::AlgDistance1<Adapter>::gno_t> &, const Teuchos::ArrayView<const int> &, const Teuchos::ArrayView<const int> &, Teuchos::RCP<const Zoltan2::AlgDistance1<Adapter>::map_t>, const std::unordered_map<Zoltan2::AlgDistance1<Adapter>::lno_t, std::vector<int, std::allocator<int>>, std::hash<Zoltan2::AlgDistance1<Adapter>::lno_t>, std::equal_to<Zoltan2::AlgDistance1<Adapter>::lno_t>, std::allocator<std::pair<const Zoltan2::AlgDistance1<Adapter>::lno_t, std::vector<int, std::allocator<int>>>>> &) [with Adapter=MueLu::MueLuGraphBaseAdapter<MueLu::GraphBase<int, longlong, Kokkos_Compat_KokkosSerialWrapperNode>, MueLu::GraphBase<int, longlong, Kokkos_Compat_KokkosSerialWrapperNode>>]"
/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/zoltan2/core/src/algorithms/color/Zoltan2_AlgHybridD1.hpp(495): here
            instantiation of "void Zoltan2::AlgDistance1<Adapter>::color(const Teuchos::RCP<Zoltan2::ColoringSolution<Adapter>> &) [with Adapter=MueLu::MueLuGraphBaseAdapter<MueLu::GraphBase<int, longlong, Kokkos_Compat_KokkosSerialWrapperNode>, MueLu::GraphBase<int, longlong, Kokkos_Compat_KokkosSerialWrapperNode>>]"
/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/zoltan2/core/src/algorithms/color/Zoltan2_AlgHybridD1.hpp(375): here
            implicit generation of "Zoltan2::AlgDistance1<Adapter>::~AlgDistance1() [with Adapter=MueLu::MueLuGraphBaseAdapter<MueLu::GraphBase<int, longlong, Kokkos_Compat_KokkosSerialWrapperNode>, MueLu::GraphBase<int, longlong, Kokkos_Compat_KokkosSerialWrapperNode>>]"
/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/zoltan2/core/src/algorithms/color/Zoltan2_AlgHybridD1.hpp(375): here
            instantiation of class "Zoltan2::AlgDistance1<Adapter> [with Adapter=MueLu::MueLuGraphBaseAdapter<MueLu::GraphBase<int, longlong, Kokkos_Compat_KokkosSerialWrapperNode>, MueLu::GraphBase<int, longlong, Kokkos_Compat_KokkosSerialWrapperNode>>]"
/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/zoltan2/core/src/algorithms/color/Zoltan2_AlgHybridD1.hpp(375): here
            instantiation of "Zoltan2::AlgDistance1<Adapter>::AlgDistance1(const Teuchos::RCP<const Zoltan2::AlgDistance1<Adapter>::base_adapter_t> &, const Teuchos::RCP<Teuchos::ParameterList> &, const Teuchos::RCP<Zoltan2::Environment> &, const Teuchos::RCP<const Teuchos::Comm<int>> &) [with Adapter=MueLu::MueLuGraphBaseAdapter<MueLu::GraphBase<int, longlong, Kokkos_Compat_KokkosSerialWrapperNode>, MueLu::GraphBase<int, longlong, Kokkos_Compat_KokkosSerialWrapperNode>>]"
/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/zoltan2/core/src/problems/Zoltan2_ColoringProblem.hpp(212): here
            instantiation of "void Zoltan2::ColoringProblem<Adapter>::solve(__nv_bool) [with Adapter=MueLu::MueLuGraphBaseAdapter<MueLu::GraphBase<int, longlong, Kokkos_Compat_KokkosSerialWrapperNode>, MueLu::GraphBase<int, longlong, Kokkos_Compat_KokkosSerialWrapperNode>>]"
/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/zoltan2/core/src/problems/Zoltan2_ColoringProblem.hpp(114): here
            instantiation of "Zoltan2::ColoringProblem<Adapter>::ColoringProblem(Adapter *, Teuchos::ParameterList *, const Teuchos::RCP<const Teuchos::Comm<int>> &) [with Adapter=MueLu::MueLuGraphBaseAdapter<MueLu::GraphBase<int, longlong, Kokkos_Compat_KokkosSerialWrapperNode>, MueLu::GraphBase<int, longlong, Kokkos_Compat_KokkosSerialWrapperNode>>]"
/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/muelu/src/Transfers/Classical/MueLu_ClassicalMapFactory_def.hpp(496): here
            instantiation of "void MueLu::ClassicalMapFactory<Scalar, LocalOrdinal, GlobalOrdinal, Node>::DoDistributedGraphColoring(Teuchos::RCP<const MueLu::ClassicalMapFactory<Scalar, LocalOrdinal, GlobalOrdinal, Node>::GraphBase> &, Teuchos::ArrayRCP<MueLu::ClassicalMapFactory<Scalar, LocalOrdinal, GlobalOrdinal, Node>::LO> &, MueLu::ClassicalMapFactory<Scalar, LocalOrdinal, GlobalOrdinal, Node>::LO &) const [with Scalar=double, LocalOrdinal=int, GlobalOrdinal=longlong, Node=Kokkos_Compat_KokkosSerialWrapperNode]"
/ascldap/users/ndellin/trilinos/Trilinos-pristine/packages/muelu/src/Transfers/Classical/MueLu_ClassicalMapFactory_def.hpp(216): here

I poked at the error a bit, here are a couple comments:

The error occurs with the attempted hand-off of the subview sv (created from femv which is of type Tpetra::FEMultiVector<femv_scalar_t, lno_t, gno_t>;) to set the vertex colors in the KokkosKernels graph coloring handle. https://github.com/trilinos/Trilinos/blob/25b225c8b495986164a29e0ab42b99cd8781eab3/packages/zoltan2/core/src/algorithms/color/Zoltan2_AlgHybridD1.hpp#L132 There is a mismatch between the memory space of femv's local View and the memory space expected by Views associated with the Kernel handle (colorInterior routine is templated on execution and memory space). The error persists if replacing the call to https://github.com/trilinos/Trilinos/blob/25b225c8b495986164a29e0ab42b99cd8781eab3/packages/zoltan2/core/src/algorithms/color/Zoltan2_AlgHybridD1.hpp#L130 with the non-templated getLocalViewDevice (I tried this in my local build).

The underlying problem seems to be that the types for the kernel handle and View arguments to colorInterior come from the template parameters of colorInterior, whereas femv takes on defaults and these are not required to match.

Another experiment I tried was explicitly creating a copy of the femvColors View to one templated on MemorySpace parameter of colorInterior to get a compatible View to pass to set_vertex_colors i.e.

      auto femvColors = femv->template getLocalView<Kokkos::Device<ExecutionSpace,MemorySpace> >(Tpetra::Access::ReadWrite);
      auto dfemvColors = Kokkos::create_mirror_view_and_copy(MemorySpace(), femvColors);
      auto  sv = subview(dfemvColors, Kokkos::ALL, 0); // create sv from dfemvColors

this helped get past the initial compilation error at https://github.com/trilinos/Trilinos/blob/25b225c8b495986164a29e0ab42b99cd8781eab3/packages/zoltan2/core/src/algorithms/color/Zoltan2_AlgHybridD1.hpp#L132 but similar incompatibility error messages occur in the hybridGMB routine. I stopped hacking at it to let someone knowledgeable with the code base give guidance or take over

Steps to Reproduce

SHA1: 55d7185956289a91f9f75d4510b1a734ed4efd22
Configure script: Weaver testbed

module purge
module load openmpi/4.0.1/gcc/7.2.0/cuda/10.1.105 netlib/3.8.0/gcc/7.2.0 cmake/3.19.3
export OMPI_CXX=$TRILINOS_DIR/packages/kokkos/bin/nvcc_wrapper

cmake \
 -D CMAKE_BUILD_TYPE:STRING=RELEASE \
\
 -D TPL_ENABLE_MPI:STRING=ON \
 -D MPI_EXEC_POST_NUMPROCS_FLAGS:STRING="-map-by;socket:PE=4" \
\
 -D TPL_ENABLE_BLAS:STRING=ON \
  -D TPL_BLAS_LIBRARIES:FILEPATH="-L$BLAS_ROOT/lib;-lblas;-lgfortran;-lgomp;-lm" \
 -D TPL_ENABLE_LAPACK:STRING=ON \
   -D TPL_LAPACK_LIBRARIES:FILEPATH="-L$BLAS_ROOT/lib;-llapack;-lgfortran;-lgomp" \
\
 -D Trilinos_ENABLE_TESTS=OFF \
 -D Trilinos_ENABLE_EXAMPLES=OFF \
\
 -D Trilinos_ENABLE_Kokkos=ON \
 -D Kokkos_ENABLE_CUDA=ON \
 -D Kokkos_ENABLE_CUDA_LAMBDA=ON \
 -D Kokkos_ENABLE_CUDA_UVM=ON \
 -D Kokkos_ARCH_VOLTA70=ON \
 -D Kokkos_ARCH_POWER9=ON \
\
 -D Trilinos_ENABLE_Phalanx=ON \
 -D Phalanx_ENABLE_TESTS=ON \
 -D Trilinos_ENABLE_Stokhos=ON \
 -D Stokhos_ENABLE_TESTS=ON \
 -D Trilinos_ENABLE_Zoltan2=ON \
 -D Zoltan2_ENABLE_TESTS=ON \
 -D Trilinos_ENABLE_ShyLU_NodeTacho=OFF \
\
$TRILINOS_DIR

Phalanx and Stokhos probably are not necessary to enable to reproduce, I had them enabled while digging into something else

kddevin commented 2 years ago

@ndellingwood I will look at this issue on Thursday. I suspect we should just use the array typedefs from FEMultivector. What is your timeframe for needing a fix?

ndellingwood commented 2 years ago

@kddevin thanks for the response. No urgency on the fix

kddevin commented 2 years ago

@ndellingwood I am working my way through the errors now. The fix seems to be including the device type in the template parameters of View declarations, rather than use the default device type. (e.g., Kokkos::View<int*, device_type> a rather than Kokkos::View<int*> a).

I am curious why these errors do not show up in CUDA 10.1 builds on ascicgpu machines or in Trilinos' PR testing on weaver/vortex. Can you say what is different about your configuration or about weaver? The only thing I notice about your configuration is specifying two platforms (volta70 and power9); would that lead to these errors?

Thanks, @ndellingwood

rppawlo commented 2 years ago

could uvm off be making the device types inconsistent in one case?

kddevin commented 2 years ago

@ndellingwood with this cmake configuration, if I issue Kokkos::parallel_for("Initialize verts_to_send",nVtx, KOKKOS_LAMBDA(const int&i) where do you expect it to run?

In this build, Tpetra is deciding it has SerialNode, so it thinks the device type = host. But it appears that the parallel_for loops as above are trying to run on GPU. That would imply that Tpetra has chosen a default node type that is incompatible with Kokkos' default execution space, or that all of Trilinos should always specify the execution space in each parallel_for. Is that right?

Output from CMake:

KokkosKernels ETI Types
   Devices:  <Cuda,CudaSpace>;<Cuda,CudaUVMSpace>;<Serial,HostSpace>
   Scalars:  double
   Ordinals: int
   Offsets:  int;size_t
...
-- TpetraCore: Processing ETI / test support
-- Enabled Scalar types:        long long|double
-- Enabled LocalOrdinal types:  int
-- Enabled GlobalOrdinal types: long long
-- Enabled Node types:          Kokkos::Compat::KokkosSerialWrapperNode

I am trying a build with Tpetra_ENABLE_CUDA=ON and TPL_ENABLE_CUDA=ON to see whether the behavior changes.

ndellingwood commented 2 years ago

@kddevin the parallel_for you posted should execute on the default execution space based on the configuration of Kokkos, so in the configure I provided that should be Kokkos::Cuda I assumed that if Tpetra node types were not specified at configuration then they would take the defaults from Kokkos, is my assumption incorrect? If I've misunderstood I'll need to be more careful to explicitly specify node types.

Re: errors on weaver vs ascic - I'm not certain, sorry. I don't have access on the ascic gpu nodes, but it could be that PR testing does not enable all packages with UVM on/off, depending on the build? We can take a peak at some configure logs from past merged PRs to compare

masterleinad commented 2 years ago

In this build, Tpetra is deciding it has SerialNode, so it thinks the device type = host. But it appears that the parallel_for loops as above are trying to run on GPU. That would imply that Tpetra has chosen a default node type that is incompatible with Kokkos' default execution space, or that all of Trilinos should always specify the execution space in each parallel_for. Is that right?

The precise semantics are https://github.com/kokkos/kokkos/blob/90fdc39dd357eb59efafc7b175e424e4d775c112/core/src/Kokkos_Parallel.hpp#L79-L85. In general, it is much better to specify the execution space in each parallel_for. When you are moving to use execution space instances, you have to do that anyway.

kddevin commented 2 years ago

I agree that, if not specified, Trilinos should use Kokkos' default types. I will look for the logic that makes that decision. But the problem of running in the default Kokkos space would still exist if a user picked an incompatible Trilinos space (e.g., Tpetra_INST_Serial=ON with Kokkos_INST_Cuda=ON.

I can add the execution space to the Zoltan2 loops, but I would not be surprised if other Trilinos tests fail with this configuration due to using default execution space.

rppawlo commented 2 years ago

Re: errors on weaver vs ascic - I'm not certain, sorry. I don't have access on the ascic gpu nodes, but it could be that PR testing does not enable all packages with UVM on/off, depending on the build? We can take a peak at some configure logs from past merged PRs to compare

@ndellingwood - the UMV=off build is currently disabled for Trilinos PR testing. It has been on in the past, but weaver hardware is in a bad state. All cuda testing has been moved to vortex with UVM=off temporarily disabled. Once more resources come online, we will add the UVM=off build back.

kddevin commented 2 years ago

@ndellingwood @rppawlo I was able to reproduce the errors on ascicgpu machines by removing TPL_ENABLE_CUDA=ON from my build (to better match @ndellingwood 's build). No need to check past logs. A review of #10286 would be helpful, though. Thanks.

ndellingwood commented 2 years ago

Thanks @kddevin and @rppawlo !

github-actions[bot] commented 1 year ago

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label. If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE. If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

github-actions[bot] commented 1 year ago

This issue was closed due to inactivity for 395 days.

trilinos / Trilinos