Implement MPI functions for ArborX

cwpearson commented 4 months ago

@aprokop

[x] MPI_Alltoall
[x] MPI_Barrier
[x] MPI_Allgather
- [ ] in-place
[x] MPI_Irecv
[x] MPI_Isend
[ ] MPI Derived Datatypes

probably skip these for a bare MPI wrapper

MPI_Comm_dup / MPI_Comm_free
MPI_Init
MPI_Finalize

aprokop commented 4 months ago

This sounds right. Would be nice to also do something about getting rank and size from a communicator, but that's not a big deal.

Another pretty annoying thing that we have to do is to avoid passing --help to non-zero ranks, as otherwise Kokkos would print help on all ranks. We do it like this:

  // Strip "--help" and "--kokkos-help" from the flags passed to Kokkos if we
  // are not on MPI rank 0 to prevent Kokkos from printing the help message
  // multiply.
  if (comm_rank != 0)
  {
    auto *help_it = std::find_if(argv, argv + argc, [](std::string const &x) {
      return x == "--help" || x == "--kokkos-help";
    });
    if (help_it != argv + argc)
    {
      std::swap(*help_it, *(argv + argc - 1));
      --argc;
    }
  }
  Kokkos::initialize(argc, argv);

aprokop commented 4 months ago

Another thing that could be worth trying to address at some point is to be able to run with GPU-aware MPI implementations, rather than doing everything on the host. We have a pathway in ArborX for it (even though we never saw it running faster, but it was like 4 years ago, maybe things are better now).

cwpearson commented 4 months ago

I expect GPU-aware MPI to be pretty high on the priority list once we get some of these initial functions implemented!

janciesko commented 4 months ago

GPU-aware MPI installations should be supported out of the box. Note that the communication itself is still host-triggered.

dutkalex commented 4 months ago

GPU-aware MPI installations should be supported out of the box. Note that the communication itself is still host-triggered.

GPU-aware communications work out of the box indeed, but a nice to have thing in my experience is a fallback mode which deep copies onto the host, sends and deep copies back upon reception, for non GPU-aware MPI installations (or badly installed supposedly GPU-aware installations which just crash in practice...). It can easily be implemented with a few #ifdefs, and it enables users to run on GPUs with a slight communication overhead instead of crashing with cryptic segfaults

dssgabriel commented 4 months ago

So how do we handle this?

Simply call the MPI function and pass the view's pointer to GPU memory (as we can with a GPU-aware MPI implementation)
Deep copy on a host view and initiate the communication from the host

dssgabriel commented 4 months ago

@dutkalex wrote:

It can easily be implemented with a few #ifdefs, and it enables users to run on GPUs in a slight communication overhead instead of crashing with cryptic segfaults

We should implement this with if constexpr and let users define a pre-processor variable that tells us if (a) they know their MPI implementation is GPU-aware and want the raw MPI call, or (b) they do not know if it is GPU-aware/want the view to be deep-copied onto the host and the communication is initiated from the host.

Users will always know what to expect, and it will let us remove the fallback mechanisms from the critical path once the code is compiled.

dutkalex commented 4 months ago

So how do we handle this?

To me, it's as simple as:

void send(device_view, ...){
#ifdef KOKKOSCOMM_GPU_AWARE
  send(device_view, ...);
#else
  auto mirror = Kokkos::create_mirror_view(device_view);
  Kokkos::deep_copy(mirror, device_view);
  send(mirror, ...);
#endif
}

void recv(device_view, ...){
#ifdef KOKKOSCOMM_GPU_AWARE
  recv(device_view, ...);
#else
  auto mirror = Kokkos::create_mirror_view(device_view);
  recv(mirror, ...);
  Kokkos::deep_copy(device_view, mirror);
#endif
}

You can replace the #ifdefs by if constexpr branches if you prefer but the result will be the same @dssgabriel

aprokop commented 4 months ago

I would argue that GPU-aware call should be a runtime and not compile time decision.

dutkalex commented 4 months ago

I would argue that GPU-aware call should be a runtime and not compile time decision.

Why would you want to have both compiled? When linking against an mpi implementation, gpu-aware communications either work or they don't... I guess however that having this be a runtime switch wouldn't incur any measurable overhead, so in that regard why not, but I don't see a use case for this... Maybe to measure the overhead of this degraded mode? 🤔

aprokop commented 4 months ago

A user may want to dispatch the kernel on the host or the device depending on the size of the data, or the MPI call. I think if a view is on the host, it would make sense to use host MPI routines, while if it's on the device, GPU-aware routines if available, and transfer to host if not.

dutkalex commented 4 months ago

I agree, but the compile time solution already enables this: if the view is on the host you get a regular mpi host call regardless of the branch, because the pointer passed to mpi is a host pointer and because the deep copy becomes a no-op

cedricchevalier19 commented 4 months ago

So how do we handle this?

To me, it's as simple as:

...

It is more complicated for asynchronous communications, and I am slightly annoyed by the silent extra allocation. However, we should handle this case. This scenario is one of the motivating examples for this project.

dutkalex commented 4 months ago

It is more complicated for asynchronous communications

Agreed. Async semantics require hooking the recv deep_copy to the wait() call and making sure that the host views' lifetimes are managed properly. Nothing undoable, but it's a little bit more work.

and I am slightly annoyed by the silent extra allocation.

Agreed, this is suboptimal. I don't see an obvious solution to this problem, but I bet we can come up with something better. Or at the very least more transparent to the user...

aprokop commented 3 months ago

I have an ArborX branch for kokkos-comm. Currently:

can't test it yet because of missing in-place allgather
Couple using KokkosComm::impl as irecv and alltoall are not in KokkosComm This also require manual inclusion of corresponding header (also, #76)
Support for custom data types
77

cwpearson commented 3 months ago

How do you want the custom datatypes to look? Is it a Kokkos::View of things that are described by custom datatypes?

aprokop commented 3 months ago

How do you want the custom datatypes to look? Is it a Kokkos::View of things that are described by custom datatypes?

It is Kokkos::View<Box*>, where Box is a struct of pods.

cwpearson commented 3 months ago

Do you ship the whole Box or just some of the fields?

aprokop commented 3 months ago

The whole thing. Right now, just through n * sizeof(Box) and MPI_BYTE.

kokkos / kokkos-comm

Implement MPI functions for ArborX #53

77