Reorganize for MPI / NCCL transport layers

kokkos / kokkos-comm

Experimental MPI Wrapper for Kokkos

https://kokkos.org/kokkos-comm/

Other

16 stars 10 forks source link

Reorganize for MPI / NCCL transport layers #109

Closed cwpearson closed 2 months ago

cwpearson commented 4 months ago

First part of https://github.com/kokkos/kokkos-comm/pull/100

[x] rebase after https://github.com/kokkos/kokkos-comm/pull/110
[x] rebase after https://github.com/kokkos/kokkos-comm/pull/111
[x] rebase after https://github.com/kokkos/kokkos-comm/pull/112
[x] rebase after https://github.com/kokkos/kokkos-comm/pull/113

Introduce ~`Transport`~ `CommunicationSpace` concept

This is all modeled after Kokkos Core execution spaces. Each will have its own subdirectory under src (like Kokkos' backends). The interface that each comm space needs to implement is a struct. For example, Irecv, from KokkosComm_fwd.hpp:

template <KokkosView RecvView, KokkosExecutionSpace ExecSpace = Kokkos::DefaultExecutionSpace,
          CommunicationSpace CommSpace = DefaultCommunicationSpace>
struct Irecv;

Why a struct? Because in practice what this means is that the comm space will have a partial specialization (not allowed for functions) that looks like this (from mpi/KokkosComm_mpi_irecv.hpp):

template <KokkosExecutionSpace ExecSpace, KokkosView RecvView>
struct Irecv<RecvView, ExecSpace, Mpi> { ... };

Where Mpi is a struct in the KokkosComm namespace, analogous to e.g. Kokkos::Cuda. A Serial transport would have a corresponding

template <KokkosExecutionSpace ExecSpace, KokkosView RecvView>
struct Irecv<RecvView, ExecSpace, Serial> { ... };

NCCL would have

template <KokkosExecutionSpace ExecSpace, KokkosView RecvView>
struct Irecv<RecvView, ExecSpace, NCCL> { ... };

and so on.

To support this, a lot of the include structure needs to be refined and adjusted to more closely match how Kokkos Core does it.

Future work:

can we have two enabled transports? If so, do we have a preferred and a fallback? Does choosing between them need to happen at runtime?

Introduce a `KokkosComm::Handle`

This hides the CommunicationSpace details from the KokkosComm interface. It's specialized on each CommunicationSpace. For MPI, this holds a Kokkos execution space instance and an MPI communicator.

Reduce main interface to 3 functions:

KokkosComm::Req KokkosComm::isend(Handle...);
KokkosComm::Req KokkosComm::irecv(Handle...);
void KokkosComm::barrier(Handle...);

Move MPI-specific wrappers to `mpi` namespace

This is so we can still have useful MPI interop stuff but it's not part of the "blessed" interface

Update unit tests and perf tests for new layout

Many unit-tests are now MPI specific. All perf_tests are only built if MPI is built (this could be revisited when other Transports are done).

Move `CommMode` and related interfaces to `mpi` namespace

This is MPI-specific stuff. I also made the CommMode stuff more canonically C++

switched e.g. ReadyCommMode -> CommModeReady
Made the comm mode the last argument, and simplified how it's used in some places

Add some additional support for non-contiguous views along existing lines

A Req<Mpi> knows how to keep track of some lambda functions it can call after MPI_Wait (e.g. to unpack the result of an MPI_IRecv).

Grab-bag

req.wait-> KokkosComm::wait(req);
add wait_any and wait_all.

dssgabriel commented 4 months ago

I haven't looked into the code in detail, but a few notes:

I think the mpi namespace belongs in Impl, and we shouldn't expose it to users directly (same for nccl and other transports we may implement). We only provide a "basic" API: send/receive, collectives, and other needed utilities (barriers, waits, etc...), and don't expose the underlying specialization, similarly to how Kokkos' parallel_* functions aren't prefixed by an ExecSpace namespace;
We should remove the i in front of isend/irecv to part ways with the MPI naming scheme. It will be clearer, as we won't provide blocking versions anyway, and the non-blocking behavior is clearly defined in the docs. Also matches NCCL better! :slightly_smiling_face:
As discussed during last week's meeting, we should drop the Serial mode entirely. Users should be able to emulate it using only one (MPI) process. If not, their code is probably broken;
A random idea I just had, but what about CommunicationSpace instead of Transport? It sounds similar to Kokkos' ExecutionSpace and is modeled similarly as far as I can see here.

Thank you for the work on this big PR; things are progressing nicely!

cwpearson commented 4 months ago

I think the mpi namespace belongs in Impl, and we shouldn't expose it to users directly (same for nccl and other transports we may implement). We only provide a "basic" API: send/receive, collectives, and other needed utilities (barriers, waits, etc...), and don't expose the underlying specialization, similarly to how Kokkos' parallel_* functions aren't prefixed by an ExecSpace namespace;

I was debating this myself - I was thinking about the MPI interop angle. I think it makes sense for us to actually provide stable-ish MPI wrappers since that is one of the motivations for this library and what our current users want from it. Impl suggests they are just an implementation detail. I wanted to tuck them away in mpi though so it's clear they're not part of the "blessed" "core" API where we might do some new stuff. Does it seem okay?

We should remove the i in front of isend/irecv to part ways with the MPI naming scheme. It will be clearer, as we won't provide blocking versions anyway, and the non-blocking behavior is clearly defined in the docs. Also matches NCCL better! 🙂

Will do

As discussed during last week's meeting, we should drop the Serial mode entirely. Users should be able to emulate it using only one (MPI) process. If not, their code is probably broken;

This doesn't actually implement anything except MPI, so no danger here 😄 . We could implement Serial (or not) in the future as needed.

A random idea I just had, but what about CommunicationSpace instead of Transport? It sounds similar to Kokkos' ExecutionSpace and is modeled similarly as far as I can see here.

Good idea.

cwpearson commented 4 months ago

@dssgabriel I'm proposing some modifications to how the comm mode is used https://github.com/kokkos/kokkos-comm/pull/109/commits/91260927aba6785946a825143dd529062ca9f692 this could potentially be a separate PR

dssgabriel commented 4 months ago

I think it makes sense for us to actually provide stable-ish MPI wrappers since that is one of the motivations for this library and what our current users want from it. Impl suggests they are just an implementation detail. I wanted to tuck them away in mpi though so it's clear they're not part of the "blessed" "core" API where we might do some new stuff.

I fear that if we provide direct MPI wrappers people will just start using those instead of our "core" API, as it may be easier for them to adopt. This won't add any value to their code, and they may as well keep the MPI interop they already have in place. Given that we are trying to design an API that is "transport-agnostic", I think we should push users towards it as much as possible.

Some others questions/remarks:

The interface that each comm space needs to implement is a struct.

Do we really need structs or could we circumvent that and only provide functions? If Kokkos handles the dispatch of parallel_* patterns on different execution space without structs, could we try a similar approach? Genuine question as I don't know how it is implemented in Kokkos.
Can we have two enabled transports? If so, do we have a preferred and a fallback? Does choosing between them need to happen at runtime?

I think this is mandatory. NCCL does not explicitly require MPI to be initialized, but it is much easier to do with it. IMHO, MPI should be our fallback, but perhaps we can make it the default when running on a host exec space vs NCCL is the default when on a (CUDA) device? Regarding whether we can choose the backend at runtime is still unclear, I think we need to write some code to explore how it should work.

Let's discuss this PR during next week's meeting, we may want to split it up even further. I'll give a better overview of what NCCL offers and how it should interact with this PR. :+1:

cwpearson commented 4 months ago

It's done with structs in Kokkos because you can't do partial function specialization - e.g. you CAN'T do

template <typename View, typename CommunicationSpace>
void send(View v);

// KokkosComm implementation for MPI in here
template <typename View>
void send<View, Mpi>(View v) {...};

but you CAN do

template <typename View, typename CommunicationSpace>
struct send;

// KokkosComm implementation for MPI in here
template <typename View>
struct send<View, Mpi> {...};

cwpearson commented 4 months ago

I fear that if we provide direct MPI wrappers people will just start using those instead of our "core" API, as it may be easier for them to adopt. This won't add any value to their code, and they may as well keep the MPI interop they already have in place. Given that we are trying to design an API that is "transport-agnostic", I think we should push users towards it as much as possible.

The value of MPI wrappers for the users is two places

we maintain the nitpicky non-contiguous-data-handling code rather than them, and perhaps transparently provide better performance.
We make MPI / execution space instance syncing harder to mess up.

There is demonstrated interest in these. They can be realized in the near term, and provide value to the overall effort:

We establish some cred in the community for when we propose new ideas
People integrate the library with their apps, so we have a ready-made set of people to test the new ideas.

Like you (?), I think the core API should be more thoughtful and novel. However, if what we come up with is such a small step forward that people find MPI-Kokkos interop more compelling, then they weren't going to use the new thing anyway.

cwpearson commented 4 months ago

@dssgabriel how are we feeling about this? We have an external party asking about writing a backend to support their research, plus our own NCCL work.

dssgabriel commented 4 months ago

I think this is on the right track!

A few points of discussion that we probably need to figure out before merging:

NCCL does not have the concept of tags: do we consider removing them from our "core/blessed" API? Maybe we can provide an overload that has the tag (so we can just ignore it in the NCCL specialization).
NCCL does not have the concept of communication modes: do we drop them entirely? I am not sure they have actual use cases in the applications we target. Alternatively, we could keep them as part of the MPI-interop API only (the one in the KokkosComm::mpi namespace).
Does the PR in its current form let us have multiple CommunicationSpaces enabled at the same time? I feel like this will be a requirement (e.g., NCCL is far easier to initialize if MPI is available, I don't know about other transport layers).
IMO, the OSU perf test should be in a dedicated mpi/ subdirectory (similar to some of the unit tests), so as not to mix tests that use the "core/blessed" API and the MPI-specific stuff.

As this is a big PR that moves/renames a lot of files, we might as well really think about our project layout/structure. I am quite a fan of system includes that are prefixed with the project name (e.g. boost, fmt, etc...):

#include <KokkosComm/KokkosComm.hpp>
#include <KokkosComm/mpi/comm_modes.hpp>
...

Rearranging the layout would let us remove some of the prefixing in the filenames, the project structure specifies things (this is only a proposition, it is to be discussed/reworked!):

.
├── src
|   ├── KokkosComm
|   |   ├── mpi
|   │   |   ├── collectives
|   │   │   |   ├── allgather.hpp
|   │   │   |   ├── reduce.hpp
|   │   │   |   └── ...
|   │   |   ├── point_to_point
|   │   │   |   ├── recv.hpp
|   │   │   |   └── send.hpp
|   │   |   ├── utilities
|   │   │   |   ├── test_all.hpp
|   |   |   |   ├── wait.hpp
|   │   │   |   └── ...
|   │   |   ├── interop
|   │   |   │   └── ...
|   │   |   ├── comm_modes.hpp
|   │   |   └── ...
|   |   ├── nccl
|   │   |   └── ... # similar structure as `mpi`
|   |   └── whatever_backend
|   |       └── ... # same
|   ├── concepts.hpp
|   ├── KokkosComm.hpp
|   └── ...
├── unit_tests
|   ├── core
|   ├── mpi
|   └── ...
└── perf_tests
  ├── core
  ├── mpi
  |   └── osu
  └── ...

I will review the code once again tomorrow so that we can discuss it on next monday's meeting and quickly merge it afterwards. :+1:

cwpearson commented 4 months ago

* NCCL does not have the concept of tags: do we consider removing them from our "core/blessed" API? Maybe we can provide an overload that has the tag (so we can just ignore it in the NCCL specialization).

I tried reading the NCCL docs but I couldn't figure out what the semantics are if I do two ncclSend/ncclRecv in a row to the same destination. Do you know if they are non-overtaking like MPI? I think there needs to be some way to distinguish multiple in-flight messages with the same source and dest, but keeping messages ordered may be enough., I'll drop the tags for now and we can discuss.

* NCCL does not have the concept of communication modes: do we drop them entirely? I am not sure they have actual use cases in the applications we target.
  Alternatively, we could keep them as part of the MPI-interop API only (the one in the `KokkosComm::mpi` namespace).

I think just keep them for MPI since they are basically an MPI thing.

* Does the PR in its current form let us have multiple CommunicationSpaces enabled at the same time? I feel like this will be a requirement (e.g., NCCL is far easier to initialize if MPI is available, I don't know about other transport layers).

I was thinking we could support Preferred/Fallback - if the Preferred CommunicationSpace implements an API, it will be used, otherwise the fallback would be used (MPI would be the fallback for the forseeable future).
It may be okay for a preferred CS to require a specific fallback CS (or a fallback CS at all) to be enabled (maybe NCCL falls into this camp, it needs a fallback to help initialize itself). Or maybe just have the NCCL CS actually be a hybrid NCCL / MPI CS. I don't think we want to allow people to say "oh, my backend requires X/Y/Z CS also be enabled", too complicated. Something that complicated should just be its own CS that manages / uses X/Y/Z internally

* IMO, the OSU perf test should be in a dedicated `mpi/` subdirectory (similar to some of the unit tests), so as not to mix tests that use the "core/blessed" API and the MPI-specific stuff.

I think you're right about this, let me take another stab at it.

* As this is a big PR that moves/renames a lot of files, we might as well really think about our project layout/structure.
  I am quite a fan of system includes that are prefixed with the project name (e.g. boost, fmt, etc...):
  ```c++
  #include <KokkosComm/KokkosComm.hpp>
  #include <KokkosComm/mpi/comm_modes.hpp>
  ...
  ```

I prefer this too, I was just following Kokkos Core and Kernels. Let me see if I can make this change. Then we'll really be touching every file 😄

dssgabriel commented 3 months ago

I tried reading the NCCL docs but I couldn't figure out what the semantics are if I do two ncclSend/ncclRecv in a row to the same destination.

As far as I understand, NCCL enqueues communication operations on a CUDA stream. If two successive NCCL calls are on the same stream, they are non-overtaking. However, I think two communications enqueued on different streams may be able to overtake each other? I believe some folks at CEA know the lead developer of NCCL, so we'll try to set up a meeting with them and get answers to all our semantics-related questions.

I think just keep [CommModes] for MPI since they are basically an MPI thing.

LGTM :+1:

I don't think we want to allow people to say "oh, my backend requires X/Y/Z CS also be enabled", too complicated. Something that complicated should just be its own CS that manages/uses X/Y/Z internally.

I agree. Maybe we can automatically enable MPI if users compiled their code with -DKokkosComm_ENABLE_NCCL=ON? To be fair, NCCL only needs to broadcast a unique identifier to all participating processes, so it's a rather "lightweight" setup. Broadcasting is doable using UNIX sockets but it's much easier if we have MPI_Bcast available.

I prefer this too, I was just following Kokkos Core and Kernels. Let me see if I can make this change. Then we'll really be touching every file 😄

Great! I'm looking forward to reviewing every file :grin: Thank you for all the work on this PR!

cwpearson commented 3 months ago

I ended up moving all the perf tests to an mpi subdirectory because the perf test driver uses MPI. We should probably re-write it in our core API as a follow-up.

cwpearson commented 3 months ago

@dssgabriel I think all future work is waiting behind this PR, how are we feeling?

dssgabriel commented 2 months ago

Do we want to go ahead and just merge this? We could open small issues and gradually fix the various remarks we have brought up here.

cwpearson commented 2 months ago

Let's merge this, and then I can open up issues for the comments here. I need an approving review though 😄

kokkos / kokkos-comm

Reorganize for MPI / NCCL transport layers #109

Introduce ~Transport~ CommunicationSpace concept

Introduce a KokkosComm::Handle

Reduce main interface to 3 functions:

Move MPI-specific wrappers to mpi namespace

Update unit tests and perf tests for new layout

Move CommMode and related interfaces to mpi namespace

Add some additional support for non-contiguous views along existing lines

Grab-bag

Introduce ~`Transport`~ `CommunicationSpace` concept

Introduce a `KokkosComm::Handle`

Move MPI-specific wrappers to `mpi` namespace

Move `CommMode` and related interfaces to `mpi` namespace