Open dssgabriel opened 1 month ago
I'm not sure it's a good idea to treat host-parallel kernels special. In particular, it's already problematic to create a managed View since the destructor will call Kokkos::fence
which will deadlock. I can see that there will be more restrictions like this.
@masterleinad We've decided to simply disallow calling KokkosComm from within Kokkos parallel regions for now. :+1:
The project currently does not define any semantics/rules regarding calls to KokkosComm functions within Kokkos parallel regions. This issue tries to address this and start a conversation on the usage rules we might want to set.
Semantics
Host space
Calls to KokkosComm inside a host-dispatched parallel region should behave like "classic" MPI + OpenMP applications, where multiple OpenMP threads may issue MPI calls simultaneously. This is OK as long as MPI was initialized using
MPI_THREAD_MULTIPLE
. However, Kokkos lambdas/functors imply some limitations; in particular, it is impossible to start communications inside of the parallel region and wait for them outside of it (see code example below). This will outright fail to compile when usingKOKKOS_LAMBDA
s, as the list of communication requests is captured by value. Trying to circumvent this with explicit by-reference captures would mean the code isn't "Kokkos-compliant" anymore.In host-dispatched serial regions, the same rules apply.
It should be ok to call KokkosComm inside of host-parallel regions, as long as all communications started in that region are waited for in the same regions, keeping the code valid from Kokkos' point of view.
Device space
Depending on the chosen communication backend, calls to KokkosComm inside a device-dispatched parallel region may behave differently:
How are we defining "device-initiated communications"? From NCCL's perspective it is simply a communication that is entirely performed on the GPU, but it is not "started" from within a device function; the kernel call must come from the host. We assume that other device-accelerated libraries (AMD's RCCL, Intel's oneCCL, etc...) all follow the same semantics.
Code example
The above code fails to compile with the following error:
The compiler complains about us mutably using the requests outside of the Kokkos lambda, thus discarding the
const
qualifier added by the generated functor.On the other hand, if the requests are waited on the parallel region, the following code is valid and runs correctly (tested with
Kokkos::OpenMP
host execution space):Propositions
KOKKOS_LAMBDA
and do not abuse its limitiations.