kokkos / kokkos-comm

Experimental MPI Wrapper for Kokkos
https://kokkos.org/kokkos-comm/
Other
12 stars 9 forks source link

Define semantics for calling KokkosComm within parallel regions #115

Open dssgabriel opened 1 month ago

dssgabriel commented 1 month ago

The project currently does not define any semantics/rules regarding calls to KokkosComm functions within Kokkos parallel regions. This issue tries to address this and start a conversation on the usage rules we might want to set.

Semantics

Host space

Calls to KokkosComm inside a host-dispatched parallel region should behave like "classic" MPI + OpenMP applications, where multiple OpenMP threads may issue MPI calls simultaneously. This is OK as long as MPI was initialized using MPI_THREAD_MULTIPLE. However, Kokkos lambdas/functors imply some limitations; in particular, it is impossible to start communications inside of the parallel region and wait for them outside of it (see code example below). This will outright fail to compile when using KOKKOS_LAMBDAs, as the list of communication requests is captured by value. Trying to circumvent this with explicit by-reference captures would mean the code isn't "Kokkos-compliant" anymore.

In host-dispatched serial regions, the same rules apply.

It should be ok to call KokkosComm inside of host-parallel regions, as long as all communications started in that region are waited for in the same regions, keeping the code valid from Kokkos' point of view.

Device space

Depending on the chosen communication backend, calls to KokkosComm inside a device-dispatched parallel region may behave differently:

How are we defining "device-initiated communications"? From NCCL's perspective it is simply a communication that is entirely performed on the GPU, but it is not "started" from within a device function; the kernel call must come from the host. We assume that other device-accelerated libraries (AMD's RCCL, Intel's oneCCL, etc...) all follow the same semantics.

Code example

auto rank = /* get rank from CommSpace */;
auto size = /* get size from CommSpace */;

if (0 == rank) {
  std::vector<KokkosComm::Req> requests(size);
  Kokkos::parallel_for(
    Kokkos::RangePolicy(space, 1, size),
    KOKKOS_LAMBDA(int const dst) {
      requests[dst] = KokkosComm::isend(space, view, dst, tag, comm);
    }
  );

  for (auto &req : requests) {
    req.wait();
  }
} else { /* ... */ }

The above code fails to compile with the following error:

error: passing ‘const KokkosComm::Req’ as ‘this’ argument discards qualifiers [-fpermissive]
   40 |           requests[i] = KokkosComm::isend(space, v, i, tag, comm);
      |                                                                 ^~
note:   in call to ‘KokkosComm::Req& KokkosComm::Req::operator=(KokkosComm::Req&&)’
   26 | class Req {
      |       ^~~

The compiler complains about us mutably using the requests outside of the Kokkos lambda, thus discarding the const qualifier added by the generated functor.

On the other hand, if the requests are waited on the parallel region, the following code is valid and runs correctly (tested with Kokkos::OpenMP host execution space):

Kokkos::parallel_for(
  Kokkos::RangePolicy(space, 1, size),
  KOKKOS_LAMBDA(int const dst) {
    auto req = KokkosComm::isend(space, view, dst, tag, comm);
    req.wait();
  }
);

Propositions

masterleinad commented 1 month ago

I'm not sure it's a good idea to treat host-parallel kernels special. In particular, it's already problematic to create a managed View since the destructor will call Kokkos::fence which will deadlock. I can see that there will be more restrictions like this.

dssgabriel commented 1 month ago

@masterleinad We've decided to simply disallow calling KokkosComm from within Kokkos parallel regions for now. :+1: