zero overhead low level api

dutkalex commented 6 months ago

This PR proposes a design which we've had some success with in our in-house solution. This does not have to be integrated as is, the goal is rather to initiate a discussion on the design aspects. I believe this design enables to support the current type-erased solution implemented in the Req class, but is also compatible with other semantics/tradeoffs

dutkalex commented 6 months ago

Different problems requires different communication strategies. Therefore, using KokkosComm should not force the user to use a suboptimal communication mechanism with regards to the problem at hand. This PR proposes a first draft of a flexible enough design (IMO at least, please let me know if I'm missing something), which decouples the wrapping of the raw MPI functionalities from the higher-level communication scheme. I have laid out the outline of 2 very different schemes both relying on the same low level wrappers to demonstrate my point: the current type-erased solution, and a more static solution (roughly mirroring what I have developed for our in-house CFD solver) which is better suited for reccurent communication patterns which are known ahead of time

devreal commented 6 months ago

In general: what is the benefit of having two ways of doing the same thing? There are free-standing functions and this introduces member functions that (ideally) do the same. Can you explain a little more what the upside of member functions is?

dutkalex commented 6 months ago

@devreal What I advocate for is not having two ways of doing the same thing. Once again this is not meant to be merged as is, and the only reason why I did not use the already implemented internals is because of their tight coupling with the type erased Req class.

The first idea I advocate for is to decouple the wrapping of the raw MPI primitives from the communication scheme considerations. The current implementation forces the user to use the type-erased solution, which comes with some overhead, and which is not the best solution for all problems. I would even argue that this would be a strong argument against using KokkosComm if having memory allocations all over the place is not an option. Therefore, I think it makes sense to first introduce basic idiomatic C++ wrappers to ease manipulation of the raw MPI primitives, and then built on top of that to propose different communication strategies with different tradeoffs, or even let the user implement a custom one using the lower-level wrappings.

The second idea I advocate for is that a member function API would be better than the current free function implementation for the following reasons:

No weakly-type MPI_Comm object leaks into the client code with this approach
All MPI calls require a MPI_Comm object, so it makes sense to implement these as member functions IMO
With this approach, no KokkosComm:: prefix is needed which makes the syntax simpler
The choice of a communication mode is often not specific to a single MPI call. Therefore, although it is not demonstrated in the PR code, the Communicator abstraction can be used to store this kind of general parameters. This also contributes to having a simpler syntax for most MPI calls: auto req = comm.irecv( view, from_rank, tag ); is more elegant and user-friendly than the more verbose auto req = KokkosComm::irecv< RecvMode >( view, from_rank, tag, MPI_COMM_WORLD );, especially when doing many communication calls
Building on top of the previous argument, it is also easy to decouple construction from behavior with this approach. For example, I can create a communicator object with some choices of communication modes, and then hand this to my solver which does not have to care about these considerations which are encapsulated in the communicator object. With this approach it becomes really easy to enforce a coherent application-wide configuration

dutkalex commented 6 months ago

The choice of a communication mode is often not specific to a single MPI call. Therefore, although it is not demonstrated in the PR code, the Communicator abstraction can be used to store this kind of general parameters.

I have made the necessary changes to illustrate the potential of the approach. Also tagging @cwpearson for reference

devreal commented 6 months ago

This PR proposes a first draft of a flexible enough design (IMO at least, please let me know if I'm missing something), which decouples the wrapping of the raw MPI functionalities from the higher-level communication scheme

Do you mean "decoupling communication and resource management"? We could have a way to allow users to query the requirements for a given view to manage resources explicitly, both for datatypes and temporary buffers:

// query whether we need a temporary buffer or a derived datatype
std::variant<size_t, KokkosComm::Type> opt = KokkosComm::query(view);
if (std::holds_alternative<size_t>(opt)) {
  // provide a temporary buffer for packing, takes ownership until completion
  ensure_size(std::get<size_t>(opt), my_properly_sized_buf);
  KokkosComm::isend(view, comm, ..., my_properly_sized_buf);
} else {
  // provides a datatype we have cached (could be built-in for contiguous views)
  KokkosComm::isend(view, comm, ..., std::get<KokkosComm::Type>(opt));
}

// no cached information for another_view, let kokkos figure it out
KokkosComm::isend(another_view, comm, ...,);

dutkalex commented 6 months ago

Do you mean "decoupling communication and resource management"?

Yes @devreal ! We should have a zero-overhead abstraction layer on top of MPI which simply does the encapsulation with the same semantics, but exposing a nicer and more kokkos-friendly API, and then build on top of that all the additional fancy functionalities, rather than have do-it-all wrappers as is currently the case...

We could have a way to allow users to query the requirements for a given view to manage resources explicitly, both for datatypes and temporary buffers

This seems reasonable to me, although I dont quite get where you get your my_properly_sized_buf object from in your example...

dutkalex commented 6 months ago

@devreal considering your example, an interesting approach could be to have the communicator.member_fcn() API have the same semantics as the raw MPI call and be just syntactic sugar basically, and have the free functions implement the more elaborate semantics (as in the last line of your example copied below) using the member functions API. What do you think of it?

// no cached information for another_view, let kokkos figure it out
KokkosComm::isend(another_view, comm, ...,);

dssgabriel commented 6 months ago

No weakly-type MPI_Comm object leaks into the client code with this approach

I like this approach. We had already considered adding a Communicator type that would wrap the underlying MPI_Comm and the Kokkos execution space for performance-related reasons. It would allow us to bypass some MPI assertions (e.g., checking pointer provenance in a GPU-only context is expensive and useless as Kokkos can already statically prove the view is in GPU memory).

With this approach, no KokkosComm:: prefix is needed which makes the syntax simpler

All MPI calls require a MPI_Comm object, so it makes sense to implement these as member functions IMO

I think this is indeed an elegant solution that reduces the verbosity of the code, and it's also the approach taken in other languages implementing MPI bindings (e.g., see rsmpi).

This effort needs a few more iterations, but we should consider it.

cwpearson commented 6 months ago

some unit tests please!

dutkalex commented 6 months ago

some unit tests please!

Do you prefer to test by source file or by feature/overload set? (I would go for the latter)

dutkalex commented 6 months ago

And at least a skeleton of documentation, and if the documentation is incomplete, open one or more issues against yourself 😄

@cwpearson Noted. Maybe I'll wait to be sure we are all on the same page before writing the doc though 😉

kokkos / kokkos-comm

zero overhead low level api #50