Closed cwpearson closed 2 months ago
I haven't looked into the code in detail, but a few notes:
mpi
namespace belongs in Impl
, and we shouldn't expose it to users directly (same for nccl
and other transports we may implement). We only provide a "basic" API: send/receive, collectives, and other needed utilities (barriers, waits, etc...), and don't expose the underlying specialization, similarly to how Kokkos' parallel_*
functions aren't prefixed by an ExecSpace namespace;i
in front of isend
/irecv
to part ways with the MPI naming scheme. It will be clearer, as we won't provide blocking versions anyway, and the non-blocking behavior is clearly defined in the docs. Also matches NCCL better! :slightly_smiling_face:Serial
mode entirely. Users should be able to emulate it using only one (MPI) process. If not, their code is probably broken;CommunicationSpace
instead of Transport
? It sounds similar to Kokkos' ExecutionSpace
and is modeled similarly as far as I can see here.Thank you for the work on this big PR; things are progressing nicely!
I think the
mpi
namespace belongs inImpl
, and we shouldn't expose it to users directly (same fornccl
and other transports we may implement). We only provide a "basic" API: send/receive, collectives, and other needed utilities (barriers, waits, etc...), and don't expose the underlying specialization, similarly to how Kokkos'parallel_*
functions aren't prefixed by an ExecSpace namespace;
I was debating this myself - I was thinking about the MPI interop angle. I think it makes sense for us to actually provide stable-ish MPI wrappers since that is one of the motivations for this library and what our current users want from it. Impl
suggests they are just an implementation detail. I wanted to tuck them away in mpi
though so it's clear they're not part of the "blessed" "core" API where we might do some new stuff. Does it seem okay?
We should remove the
i
in front ofisend
/irecv
to part ways with the MPI naming scheme. It will be clearer, as we won't provide blocking versions anyway, and the non-blocking behavior is clearly defined in the docs. Also matches NCCL better! 🙂
Will do
As discussed during last week's meeting, we should drop the
Serial
mode entirely. Users should be able to emulate it using only one (MPI) process. If not, their code is probably broken;
This doesn't actually implement anything except MPI, so no danger here 😄 . We could implement Serial (or not) in the future as needed.
A random idea I just had, but what about
CommunicationSpace
instead ofTransport
? It sounds similar to Kokkos'ExecutionSpace
and is modeled similarly as far as I can see here.
Good idea.
@dssgabriel I'm proposing some modifications to how the comm mode is used https://github.com/kokkos/kokkos-comm/pull/109/commits/91260927aba6785946a825143dd529062ca9f692 this could potentially be a separate PR
I think it makes sense for us to actually provide stable-ish MPI wrappers since that is one of the motivations for this library and what our current users want from it. Impl suggests they are just an implementation detail. I wanted to tuck them away in mpi though so it's clear they're not part of the "blessed" "core" API where we might do some new stuff.
I fear that if we provide direct MPI wrappers people will just start using those instead of our "core" API, as it may be easier for them to adopt. This won't add any value to their code, and they may as well keep the MPI interop they already have in place. Given that we are trying to design an API that is "transport-agnostic", I think we should push users towards it as much as possible.
Some others questions/remarks:
The interface that each comm space needs to implement is a struct.
Do we really need structs or could we circumvent that and only provide functions? If Kokkos handles the dispatch of parallel_*
patterns on different execution space without structs, could we try a similar approach? Genuine question as I don't know how it is implemented in Kokkos.
Can we have two enabled transports? If so, do we have a preferred and a fallback? Does choosing between them need to happen at runtime?
I think this is mandatory. NCCL does not explicitly require MPI to be initialized, but it is much easier to do with it. IMHO, MPI should be our fallback, but perhaps we can make it the default when running on a host exec space vs NCCL is the default when on a (CUDA) device? Regarding whether we can choose the backend at runtime is still unclear, I think we need to write some code to explore how it should work.
Let's discuss this PR during next week's meeting, we may want to split it up even further. I'll give a better overview of what NCCL offers and how it should interact with this PR. :+1:
It's done with structs in Kokkos because you can't do partial function specialization - e.g. you CAN'T do
template <typename View, typename CommunicationSpace>
void send(View v);
// KokkosComm implementation for MPI in here
template <typename View>
void send<View, Mpi>(View v) {...};
but you CAN do
template <typename View, typename CommunicationSpace>
struct send;
// KokkosComm implementation for MPI in here
template <typename View>
struct send<View, Mpi> {...};
I fear that if we provide direct MPI wrappers people will just start using those instead of our "core" API, as it may be easier for them to adopt. This won't add any value to their code, and they may as well keep the MPI interop they already have in place. Given that we are trying to design an API that is "transport-agnostic", I think we should push users towards it as much as possible.
The value of MPI wrappers for the users is two places
There is demonstrated interest in these. They can be realized in the near term, and provide value to the overall effort:
Like you (?), I think the core API should be more thoughtful and novel. However, if what we come up with is such a small step forward that people find MPI-Kokkos interop more compelling, then they weren't going to use the new thing anyway.
@dssgabriel how are we feeling about this? We have an external party asking about writing a backend to support their research, plus our own NCCL work.
I think this is on the right track!
A few points of discussion that we probably need to figure out before merging:
KokkosComm::mpi
namespace).mpi/
subdirectory (similar to some of the unit tests), so as not to mix tests that use the "core/blessed" API and the MPI-specific stuff.#include <KokkosComm/KokkosComm.hpp>
#include <KokkosComm/mpi/comm_modes.hpp>
...
Rearranging the layout would let us remove some of the prefixing in the filenames, the project structure specifies things (this is only a proposition, it is to be discussed/reworked!):
.
├── src
| ├── KokkosComm
| | ├── mpi
| │  | ├── collectives
| │  │  | ├── allgather.hpp
| │  │  | ├── reduce.hpp
| │  │  | └── ...
| │  | ├── point_to_point
| │  │  | ├── recv.hpp
| │  │  | └── send.hpp
| │  | ├── utilities
| │  │  | ├── test_all.hpp
| | | | ├── wait.hpp
| │  │  | └── ...
| │  | ├── interop
| │  | │  └── ...
| │  | ├── comm_modes.hpp
| │  | └── ...
| | ├── nccl
| │  | └── ... # similar structure as `mpi`
| | └── whatever_backend
| | └── ... # same
| ├── concepts.hpp
| ├── KokkosComm.hpp
| └── ...
├── unit_tests
| ├── core
| ├── mpi
| └── ...
└── perf_tests
├── core
├── mpi
| └── osu
└── ...
I will review the code once again tomorrow so that we can discuss it on next monday's meeting and quickly merge it afterwards. :+1:
* NCCL does not have the concept of tags: do we consider removing them from our "core/blessed" API? Maybe we can provide an overload that has the tag (so we can just ignore it in the NCCL specialization).
I tried reading the NCCL docs but I couldn't figure out what the semantics are if I do two ncclSend/ncclRecv in a row to the same destination. Do you know if they are non-overtaking like MPI? I think there needs to be some way to distinguish multiple in-flight messages with the same source and dest, but keeping messages ordered may be enough., I'll drop the tags for now and we can discuss.
* NCCL does not have the concept of communication modes: do we drop them entirely? I am not sure they have actual use cases in the applications we target. Alternatively, we could keep them as part of the MPI-interop API only (the one in the `KokkosComm::mpi` namespace).
I think just keep them for MPI since they are basically an MPI thing.
* Does the PR in its current form let us have multiple CommunicationSpaces enabled at the same time? I feel like this will be a requirement (e.g., NCCL is far easier to initialize if MPI is available, I don't know about other transport layers).
I was thinking we could support Preferred/Fallback - if the Preferred CommunicationSpace implements an API, it will be used, otherwise the fallback would be used (MPI would be the fallback for the forseeable future).
It may be okay for a preferred CS to require a specific fallback CS (or a fallback CS at all) to be enabled (maybe NCCL falls into this camp, it needs a fallback to help initialize itself). Or maybe just have the NCCL CS actually be a hybrid NCCL / MPI CS. I don't think we want to allow people to say "oh, my backend requires X/Y/Z CS also be enabled", too complicated. Something that complicated should just be its own CS that manages / uses X/Y/Z internally
* IMO, the OSU perf test should be in a dedicated `mpi/` subdirectory (similar to some of the unit tests), so as not to mix tests that use the "core/blessed" API and the MPI-specific stuff.
I think you're right about this, let me take another stab at it.
* As this is a big PR that moves/renames a lot of files, we might as well really think about our project layout/structure. I am quite a fan of system includes that are prefixed with the project name (e.g. boost, fmt, etc...): ```c++ #include <KokkosComm/KokkosComm.hpp> #include <KokkosComm/mpi/comm_modes.hpp> ... ```
I prefer this too, I was just following Kokkos Core and Kernels. Let me see if I can make this change. Then we'll really be touching every file 😄
I tried reading the NCCL docs but I couldn't figure out what the semantics are if I do two ncclSend/ncclRecv in a row to the same destination.
As far as I understand, NCCL enqueues communication operations on a CUDA stream. If two successive NCCL calls are on the same stream, they are non-overtaking. However, I think two communications enqueued on different streams may be able to overtake each other? I believe some folks at CEA know the lead developer of NCCL, so we'll try to set up a meeting with them and get answers to all our semantics-related questions.
I think just keep [CommModes] for MPI since they are basically an MPI thing.
LGTM :+1:
I don't think we want to allow people to say "oh, my backend requires X/Y/Z CS also be enabled", too complicated. Something that complicated should just be its own CS that manages/uses X/Y/Z internally.
I agree. Maybe we can automatically enable MPI if users compiled their code with -DKokkosComm_ENABLE_NCCL=ON
?
To be fair, NCCL only needs to broadcast a unique identifier to all participating processes, so it's a rather "lightweight" setup. Broadcasting is doable using UNIX sockets but it's much easier if we have MPI_Bcast
available.
I prefer this too, I was just following Kokkos Core and Kernels. Let me see if I can make this change. Then we'll really be touching every file 😄
Great! I'm looking forward to reviewing every file :grin: Thank you for all the work on this PR!
I ended up moving all the perf tests to an mpi
subdirectory because the perf test driver uses MPI. We should probably re-write it in our core API as a follow-up.
@dssgabriel I think all future work is waiting behind this PR, how are we feeling?
Do we want to go ahead and just merge this? We could open small issues and gradually fix the various remarks we have brought up here.
Let's merge this, and then I can open up issues for the comments here. I need an approving review though 😄
First part of https://github.com/kokkos/kokkos-comm/pull/100
Introduce ~
Transport
~CommunicationSpace
conceptThis is all modeled after Kokkos Core execution spaces. Each will have its own subdirectory under src (like Kokkos' backends). The interface that each comm space needs to implement is a struct. For example,
Irecv
, fromKokkosComm_fwd.hpp
:Why a
struct
? Because in practice what this means is that the comm space will have a partial specialization (not allowed for functions) that looks like this (frommpi/KokkosComm_mpi_irecv.hpp
):Where
Mpi
is astruct
in theKokkosComm
namespace, analogous to e.g.Kokkos::Cuda
. A Serial transport would have a correspondingNCCL would have
and so on.
To support this, a lot of the include structure needs to be refined and adjusted to more closely match how Kokkos Core does it.
Future work:
Introduce a
KokkosComm::Handle
This hides the CommunicationSpace details from the KokkosComm interface. It's specialized on each CommunicationSpace. For MPI, this holds a Kokkos execution space instance and an MPI communicator.
Reduce main interface to 3 functions:
Move MPI-specific wrappers to
mpi
namespaceThis is so we can still have useful MPI interop stuff but it's not part of the "blessed" interface
Update unit tests and perf tests for new layout
Many unit-tests are now MPI specific. All perf_tests are only built if MPI is built (this could be revisited when other Transports are done).
Move
CommMode
and related interfaces tompi
namespaceThis is MPI-specific stuff. I also made the CommMode stuff more canonically C++
ReadyCommMode -> CommModeReady
Add some additional support for non-contiguous views along existing lines
A
Req<Mpi>
knows how to keep track of some lambda functions it can call after MPI_Wait (e.g. to unpack the result of an MPI_IRecv).Grab-bag
req.wait
->KokkosComm::wait(req);
wait_any
andwait_all
.