Closed ndellingwood closed 1 year ago
@ndellingwood I will look at this issue on Thursday. I suspect we should just use the array typedefs from FEMultivector. What is your timeframe for needing a fix?
@kddevin thanks for the response. No urgency on the fix
@ndellingwood I am working my way through the errors now. The fix seems to be including the device type in the template parameters of View declarations, rather than use the default device type. (e.g.,
Kokkos::View<int*, device_type> a
rather than Kokkos::View<int*> a
).
I am curious why these errors do not show up in CUDA 10.1 builds on ascicgpu machines or in Trilinos' PR testing on weaver/vortex. Can you say what is different about your configuration or about weaver? The only thing I notice about your configuration is specifying two platforms (volta70 and power9); would that lead to these errors?
Thanks, @ndellingwood
could uvm off be making the device types inconsistent in one case?
@ndellingwood with this cmake configuration, if I issue
Kokkos::parallel_for("Initialize verts_to_send",nVtx, KOKKOS_LAMBDA(const int&i)
where do you expect it to run?
In this build, Tpetra is deciding it has SerialNode, so it thinks the device type = host. But it appears that the parallel_for loops as above are trying to run on GPU. That would imply that Tpetra has chosen a default node type that is incompatible with Kokkos' default execution space, or that all of Trilinos should always specify the execution space in each parallel_for. Is that right?
Output from CMake:
KokkosKernels ETI Types
Devices: <Cuda,CudaSpace>;<Cuda,CudaUVMSpace>;<Serial,HostSpace>
Scalars: double
Ordinals: int
Offsets: int;size_t
...
-- TpetraCore: Processing ETI / test support
-- Enabled Scalar types: long long|double
-- Enabled LocalOrdinal types: int
-- Enabled GlobalOrdinal types: long long
-- Enabled Node types: Kokkos::Compat::KokkosSerialWrapperNode
I am trying a build with Tpetra_ENABLE_CUDA=ON
and TPL_ENABLE_CUDA=ON
to see whether the behavior changes.
@kddevin the parallel_for
you posted should execute on the default execution space based on the configuration of Kokkos, so in the configure I provided that should be Kokkos::Cuda
I assumed that if Tpetra node types were not specified at configuration then they would take the defaults from Kokkos, is my assumption incorrect? If I've misunderstood I'll need to be more careful to explicitly specify node types.
Re: errors on weaver vs ascic - I'm not certain, sorry. I don't have access on the ascic gpu nodes, but it could be that PR testing does not enable all packages with UVM on/off, depending on the build? We can take a peak at some configure logs from past merged PRs to compare
In this build, Tpetra is deciding it has SerialNode, so it thinks the device type = host. But it appears that the parallel_for loops as above are trying to run on GPU. That would imply that Tpetra has chosen a default node type that is incompatible with Kokkos' default execution space, or that all of Trilinos should always specify the execution space in each parallel_for. Is that right?
The precise semantics are https://github.com/kokkos/kokkos/blob/90fdc39dd357eb59efafc7b175e424e4d775c112/core/src/Kokkos_Parallel.hpp#L79-L85. In general, it is much better to specify the execution space in each parallel_for. When you are moving to use execution space instances, you have to do that anyway.
I agree that, if not specified, Trilinos should use Kokkos' default types. I will look for the logic that makes that decision. But the problem of running in the default Kokkos space would still exist if a user picked an incompatible Trilinos space (e.g., Tpetra_INST_Serial=ON
with Kokkos_INST_Cuda=ON
.
I can add the execution space to the Zoltan2 loops, but I would not be surprised if other Trilinos tests fail with this configuration due to using default execution space.
Re: errors on weaver vs ascic - I'm not certain, sorry. I don't have access on the ascic gpu nodes, but it could be that PR testing does not enable all packages with UVM on/off, depending on the build? We can take a peak at some configure logs from past merged PRs to compare
@ndellingwood - the UMV=off build is currently disabled for Trilinos PR testing. It has been on in the past, but weaver hardware is in a bad state. All cuda testing has been moved to vortex with UVM=off temporarily disabled. Once more resources come online, we will add the UVM=off build back.
@ndellingwood @rppawlo I was able to reproduce the errors on ascicgpu machines by removing TPL_ENABLE_CUDA=ON from my build (to better match @ndellingwood 's build). No need to check past logs. A review of #10286 would be helpful, though. Thanks.
Thanks @kddevin and @rppawlo !
This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE
label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE
.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.
This issue was closed due to inactivity for 395 days.
Bug Report
@trilinos/zoltan2
Description
The following build error occurs in the
colorInterior
routine ofpackages/zoltan2/core/src/algorithms/color/Zoltan2_AlgHybridD1.hpp
with cuda/10.1.105 builds on the Weaver testbed (Power9, Volta70) with UVM enabled:I poked at the error a bit, here are a couple comments:
The error occurs with the attempted hand-off of the subview
sv
(created fromfemv
which is of typeTpetra::FEMultiVector<femv_scalar_t, lno_t, gno_t>;
) to set the vertex colors in the KokkosKernels graph coloring handle. https://github.com/trilinos/Trilinos/blob/25b225c8b495986164a29e0ab42b99cd8781eab3/packages/zoltan2/core/src/algorithms/color/Zoltan2_AlgHybridD1.hpp#L132 There is a mismatch between the memory space offemv
's local View and the memory space expected by Views associated with the Kernel handle (colorInterior
routine is templated on execution and memory space). The error persists if replacing the call to https://github.com/trilinos/Trilinos/blob/25b225c8b495986164a29e0ab42b99cd8781eab3/packages/zoltan2/core/src/algorithms/color/Zoltan2_AlgHybridD1.hpp#L130 with the non-templatedgetLocalViewDevice
(I tried this in my local build).The underlying problem seems to be that the types for the kernel handle and View arguments to
colorInterior
come from the template parameters ofcolorInterior
, whereasfemv
takes on defaults and these are not required to match.Another experiment I tried was explicitly creating a copy of the
femvColors
View to one templated onMemorySpace
parameter ofcolorInterior
to get a compatible View to pass toset_vertex_colors
i.e.this helped get past the initial compilation error at https://github.com/trilinos/Trilinos/blob/25b225c8b495986164a29e0ab42b99cd8781eab3/packages/zoltan2/core/src/algorithms/color/Zoltan2_AlgHybridD1.hpp#L132 but similar incompatibility error messages occur in the
hybridGMB
routine. I stopped hacking at it to let someone knowledgeable with the code base give guidance or take overSteps to Reproduce
Phalanx and Stokhos probably are not necessary to enable to reproduce, I had them enabled while digging into something else