mpi-forum / mpi-issues

Tickets for the MPI Forum

http://www.mpi-forum.org/

67 stars 8 forks source link

What features do users need from an MPI C++ interface? #288

Open jeffhammond opened 4 years ago

jeffhammond commented 4 years ago

This is a meta-issue, which I am creating to capture user feedback on MPI C++ bindings.

I am moving this over from https://scicomp.stackexchange.com/questions/7978/what-features-do-users-need-from-an-mpi-c-interface, which was extremely well-received despite not complying with the rules of StackExchange.

Original Prompt

The 3.0 version of the MPI standard formally deleted the C++ interface (it was previously deprecated). While implementations may still support it, features that are new in MPI-3 do not have a C++ interface defined in the MPI standard. See http://blogs.cisco.com/performance/the-mpi-c-bindings-what-happened-and-why/ for more information.

The motivation for removing the C++ interface from MPI was that it had no significant value over the C interface. There were very few differences other than "s/_/::/g" and many features that C++ users are accustomed to were not employed (e.g. automatic type determination via templates).

As someone who participates in the MPI Forum and works with a number of C++ projects that have implemented their own C++ interface to the MPI C functions, I would like to know what are the desirable features of a C++ interface to MPI. While I commit to nothing, I would be interested in seeing the implementation of a standalone MPI C++ interface that meets the needs of many users.

And yes, I am familiar with Boost::MPI but it only supports MPI-1 features and the serialization model would be extremely difficult to support for RMA.

One C++ interface to MPI that I like is that of Elemental's mpi wrapper so perhaps people can provide some pro and con w.r.t. that approach. In particular, I think MpiMap solves an essential problem.

jeffhammond commented 4 years ago

Wolfgang Bangerth provided the following response (https://scicomp.stackexchange.com/a/7991/150):

Let me first answer why I think C++ interfaces to MPI have generally not been overly successful, having thought about the issue for a good long time when trying to decide whether we should just use the standard C bindings of MPI or building on something at higher level:

When you look at real-world MPI codes (say, PETSc, or in my case deal.II), one finds that maybe surprisingly, the number of MPI calls isn't actually very large. For example, in the 500k lines of deal.II, there are only ~100 MPI calls. A consequence of this is that the pain involved in using lower-level interfaces such as the MPI C bindings, is not too large. Conversely, one would not gain all that much by using higher level interfaces.

My second observation is that many systems have multiple MPI libraries installed (different MPI implementations, or different versions). This poses a significant difficulty if you wanted to use, say, boost::mpi that don't just consist of header files: either there needs to be multiple installations of this package as well, or one needs to build it as part of the project that uses boost::mpi (but that's a problem in itself again, given that boost uses its own build system, which is unlike anything else).

So I think all of this has conspired against the current crop of C++ interfaces to MPI: The old MPI C++ bindings didn't offer any advantage, and external packages had difficulties with the real world.

This all said, here's what I think would be the killer features I would like to have from a higher-level interface:

It should be generic. Having to specify the data type of a variable is decidedly not C++-like. Of course, it also leads to errors. Elemental's MpiMap class would already be a nice first step (though I can't figure out why the heck the MpiMap::type variable isn't static const, so that it can be accessed without creating an object).
It should have facilities for streaming arbitrary data types.
Operations that require an MPI_Op argument (e.g., reductions) should integrate nicely with C++'s std::function interface, so that it's easy to just pass a function pointer (or a lambda!) rather than having to clumsily register something.

boost::mpi actually satisfies all of these. I think if it were a header-only library, it'd be a lot more popular in practice. It would also help if it supported post-MPI 1.0 functions, but let's be honest: this covers most of what we need most of the time.

jeffhammond commented 4 years ago

@gnzlbg provided the following response (https://scicomp.stackexchange.com/a/14640/150):

My list in no particular order of preference. The interface should:

be header only, without any dependencies but <mpi.h>, and the standard library,
be generic and extensible,
be non-blocking only (if you want to block, then block explicitly, not by default),
allow continuation-based chaining of non-blocking operations,
support extensible and efficient serialization (Boost.Fusion like, such that it works with RMA),
have zero abstraction penalty (i.e. be at least as fast as the C interface),
be safe (the destructor of a non-ready future is called? -> std::terminate!),
have a strong DEBUG mode with tons of assertions,
extremely type-safe (no more ints/void* for everything, heck I want tags to be types!),
it should work with lambdas (e.g. all reduce + lambda),
use exceptions consistently as error-reporting and error-handling mechanism (no more error codes! no more function output arguments!),
MPI-IO should offer a non-blocking I/O interface in the style of Boost.AFIO,
and just follow good modern C++ interface design practices (define regular types, non-member non-friend functions, play well with move semantics, support range operations, ...)

Extras:

allow me to chose the executor of the MPI environment, that is, which thread pool it uses. Right now you can have applications with a mix of OpenMP, MPI, CUDA, and TBB... all at the same time, where each run-time thinks it owns the environment and thus ask the operating system for threads every time they feel like it. Seriously?
use the STL (and Boost) naming convention. Why? Every C++ programmer knows it.

I want to write code like this:

    auto buffer = some_t{no_ranks};
    auto future = gather(comm, root(comm), my_offsets, buffer)
                  .then([&](){
                    /* when the gather is finished, this lambda will 
                       execute at the root node, and perform an expensive operation
                       there asynchronously (compute data required for load 
                       redistribution) whose result is broadcasted to the rest 
                       of the communicator */
                    return broadcast(comm, root(comm), buffer);
                  }).then([&]() {
                    /* when broadcast is finished, this lambda executes 
                       on all processes in the communicator, performing an expensive
                       operation asynchronously (redistribute the load, 
                       maybe using non-blocking point-to-point communication) */
                     return do_something_with(buffer);
                  }).then([&](auto result) {
                     /* finally perform a reduction on the result to check
                        everything went fine */
                     return all_reduce(comm, root(comm), result, 
                                      [](auto acc, auto v) { return acc && v; }); 
                  }).then([&](auto result) {
                      /* check the result at every process */
                      if (result) { return; /* we are done */ }
                      else {
                        root_only([](){ write_some_error_log(); });
                        throw some_exception;
                      }
                  });

    /* Here nothing has happened yet! */

    /* ... lots and lots of unrelated code that can execute concurrently 
       and overlaps with communication ... */

    /* When we now call future.get() we will block 
       on the whole chain (which might have finished by then!).
    */

    future.get();

Think how one could chain all this operations using MPI_C's requests. You would have to test at multiple (or every single) intermediate step through a whole lot of unrelated code to see if you can advance your chain without blocking.

jeffhammond commented 4 years ago

GradGuy provided the following response (https://scicomp.stackexchange.com/a/8009/150):

Personally, I don't really mind calling long C-style functions for the exact reason Wolfgang mentioned; there are really few places you need to call them and even then, they almost always get wrapped around by some higher-level code.

The only things that really bother me with C-style MPI are custom datatypes and, to a lesser degree, custom operations (because I use them less often). As for custom datatypes, I'd say that a good C++ interface should be able to support generic and efficient way of handling this, most probably through serialization. This is of course the route that boost.mpi has taken, which if you are careful, is a big time saver.

As for boost.mpi having extra dependencies (particularly boost.serialization which itself is not header-only), I've recently came across a header-only C++ serialization library called cereal which seems promising; granted it requires a C++11 compliant compiler. It might worth looking into and using it as a based for something similar to boost.mpi.

jeffhammond commented 4 years ago

Utkarsh Bhardwaj provided the following response (https://scicomp.stackexchange.com/a/25094/150):

The github project easyLambda provides a high level interface to MPI with C++14.

I think the project has similar goals and it will give some idea on things that can be and are being done in this area by using modern C++. Guiding other efforts as well as easyLambda itself.

The initial benchmarks on performance and lines of code have shown promising results.

Following is a short description of features and interface it provides.

The interface is based on data flow programming and functional list operations that provide inherent parallelism. The parallelism is expressed as property of a task. The process allocation and data distribution for the task can be requested with a .prll() property. There are good number of examples in the webpage and code-repository that include LAMMPS molecular dynamics post processing, explicit finite difference solution to heat equation, logistic regression etc. As an example the heat diffusion problem discussed in the article HPC is dying... can be expressed in ~20 lines of code.

I hope it is fine to give links rather than adding more details and example codes here.

Disclamer: I am the author of the library. I believe I am not doing any harm in hoping to get a constructive feedback on the current interface of easyLambda that might be advantageous to easyLambda and any other project that pursues similar goals.

mhoemmen commented 4 years ago

Given how fast the C++ Standard is moving with respect to thread and task parallelism, coroutines, networking, and reflection, it seems premature to standardize a C++ MPI interface now. Why not let all these great libraries first build experience presenting a modern C++ interface to the latest MPI features? Why repeat the mistake of the '90s and rush to standardize? I would love for someone to modernize Boost.MPI, for example; I would be happy to help with that (at least to test changes).

If we want gather(...).then(...).then(...)...., then why not build on the C++ networking TS? If we worry about thread interactions, then why not wait on (or participate in) an executors-networking merger? I can guess some reasons why, but I would expect an MPI proposal to answer questions like that.

Regarding a header-only library: this sounds good if you're starting a new project, but some existing C++ projects that use MPI care a lot about build sizes and times. If we want to put something in the MPI Standard, I'd like to see some build experiments in real applications.

mhoemmen commented 4 years ago

Wolfgang Bangerth wrote:

My second observation is that many systems have multiple MPI libraries installed (different MPI implementations, or different versions). This poses a significant difficulty if you wanted to use, say, boost::mpi that don't just consist of header files: either there needs to be multiple installations of this package as well, or one needs to build it as part of the project that uses boost::mpi (but that's a problem in itself again, given that boost uses its own build system, which is unlike anything else).

We've dealt with this issue of multiple MPI installations by writing an MPI (C binding) library that just calls through to an underlying MPI implementation. Our library dispatches to an underlying MPI implementation at run time via dlopen or the Windows equivalent (it works great on Windows). We don't expose any details of the underlying MPI implementation's ABI, so it's handy for things like Python bindings. Our library takes effort to maintain and incurs function call overhead, but it's been useful enough that we're thinking about open-sourcing it. If you're interested, please let me know.

omor1 commented 4 years ago

We've dealt with this issue of multiple MPI installations by writing an MPI (C binding) library that just calls through to an underlying MPI implementation. Our library dispatches to an underlying MPI implementation at run time via dlopen or the Windows equivalent (it works great on Windows). We don't expose any details of the underlying MPI implementation's ABI, so it's handy for things like Python bindings.

Unrelated to the discussion at hand, but I'm curious as to how do you deal with the opaque handles (e.g. MPI_Comm, MPI_Request) that are exposed via mpi.h? These are highly implementation-dependent features whose sizes do depend on the underlying ABI. There was discussion of exactly this issue in #159. As a concrete example: in Open MPI, handles are pointers, while in MPICH-derivatives, they are int.

omor1 commented 4 years ago

Regarding a header-only library: this sounds good if you're starting a new project, but some existing C++ projects that use MPI care a lot about build sizes and times. If we want to put something in the MPI Standard, I'd like to see some build experiments in real applications.

There are both benefits and detriments to defining the MPI C++ interface so that it can be implemented as a header-only library. An obvious benefit is that a single generic implementation may be sufficient for all underlying MPI libraries, which can ease adoption and maintenance burden. The flip side is that then there are severe restrictions on the e.g. datatypes interface, as they would be required to use the MPI C interface rather than whatever low-level representation is used by the implementation.

mhoemmen commented 4 years ago

@acdemiralp wrote:

Why not co-develop it along with the C++ standard?

Yes -- let's write a library first, then standardize it. Maybe that means becoming a Boost.MPI developer or taking over Boost.MPI development, or maybe it means starting a new library (if one can make a strong technical argument that Boost.MPI has a fundamentally flawed design).

sg0 commented 4 years ago

Thanks for initiating the discussion, Jeff.

I am unsure if a number of ubiquitous C++ idioms can be supported by an MPI C++ binding (for e.g., RAII, because a C++ destructor can be called after MPI_Finalize). As such, perhaps we can identify the C++ idioms that can be supported in a conformant way in such a binding, since in C++ there are potentially different ways to implement/design an interface.

In terms of ownership, since MPI does not own the data and request buffers (users responsibility), the C++ interface must follow suit. However, from the example mentioned by Mark H., it seems the return object of the MPI function invocation is a future. From my discussion with a few other forum members, it seems future objects can represent MPI request objects; that means the MPI C++ interface have to maintain the intermediate futures. Futures may require ownership transfer in certain cases, which involve extra copies. It seems for a C++ user, allowing an interface that accepts C++20 ranges[] could be quite useful (not using ranges from std:: but implementing it keeping the interface). But, this would require 'hiding' (hence maintaining) derived datatypes, so again I don't know if passing this responsibility to the C++ API is appropriate performance-wise (may require extra copies during scope transitions). [] https://en.cppreference.com/w/cpp/ranges https://en.cppreference.com/w/cpp/ranges

A templated C++ free-function based approach will perhaps be the easiest to implement and lead to the least overhead. But, that means we won't be making use of the modern C++ functionalities.

On Mon, Apr 27, 2020 at 7:44 PM Mark Hoemmen notifications@github.com wrote:

@acdemiralp https://github.com/acdemiralp wrote:

Why not co-develop it along with the C++ standard?

Yes -- let's write a library first, then standardize it. Maybe that means becoming a Boost.MPI developer or taking over Boost.MPI development, or maybe it means starting a new library (if one can make a strong technical argument that Boost.MPI has a fundamentally flawed design).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mpi-forum/mpi-issues/issues/288#issuecomment-620344353, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABCNC6IRKDBSJIIE5643S43ROY7KBANCNFSM4MQE2IGA .

-- Sayan | https://sg0.github.io/

mhoemmen commented 4 years ago

@sg0 wrote:

However, from the example mentioned by Mark H., it seems the return object of the MPI function invocation is a future.

It would be a sender, in P0443R13 terms, not a future. Senders and receivers avoid some of the shared state issues that futures have.

In any case, I'm not necessarily advocating this design. I'm just saying that if people want that kind of design, then it should fit with how modern C++ is doing it. I'd like to see the people doing that design engage with C++ networking and executors experts.

StellarTodd commented 4 years ago

We've dealt with this issue of multiple MPI installations by writing an MPI (C binding) library that just calls through to an underlying MPI implementation. Our library dispatches to an underlying MPI implementation at run time via dlopen or the Windows equivalent (it works great on Windows). We don't expose any details of the underlying MPI implementation's ABI, so it's handy for things like Python bindings.

Unrelated to the discussion at hand, but I'm curious as to how do you deal with the opaque handles (e.g. MPI_Comm, MPI_Request) that are exposed via mpi.h? These are highly implementation-dependent features whose sizes do depend on the underlying ABI. There was discussion of exactly this issue in #159. As a concrete example: in Open MPI, handles are pointers, while in MPICH-derivatives, they are int.

We defined a Handle class that contains a union, and conversions methods for converting back and forth between native handles and our handles. The conversions are done in the plugin portion of the library that is compiled against a specific MPI implementation.

Since this is off topic, I don't want to get into any more details here. Feel free to contact Mark or me for further details.

jeffhammond commented 4 years ago

Given how fast the C++ Standard is moving with respect to thread and task parallelism, coroutines, networking, and reflection, it seems premature to standardize a C++ MPI interface now. Why not let all these great libraries first build experience presenting a modern C++ interface to the latest MPI features? Why repeat the mistake of the '90s and rush to standardize? I would love for someone to modernize Boost.MPI, for example; I would be happy to help with that (at least to test changes).

As a person who also considered updating Boost.MPI, and then walked through a 900+ page standard to see if it is feasible to write a full C++17 wrapper around it from scratch, and then gave up on all these due to the solo amount involved and used barebones C for MPI in an otherwise fully modern C++17 application: Why not co-develop it along with the C++ standard? The majority of the features you mention are already concrete, and even provide experimental/predecessor implementations.

You don’t have to implement everything to make an impact on the MPI Forum. If you look at the BigMPI stuff I did, I hit most or all of the relevant functions but didn’t support datatypes. People understood how to generalize.

As for Boost.MPI enhancements, adding support for nonblocking collectives, Mprobe/Mrecv, and neighborhood collectives is both important and straightforward. RMA will be hard but just leave that for now. It doesn’t make it or break it for either goal.

In any case, if you are serious about Boost.MPI3, setup a repo for it, add the classes of functionality you want to support, and start with easy stuff like Mrecv. Tag me in any issues where you need help understanding the document. It has been a while but I have read it cover to cover at least once, and the meaty stuff many times.

You might also look at code generation methods like mpiwrap from LLNL to understand how to automate away some of the tedium. It’s not designed for this purpose, but it might be useful anyways.

raffenet commented 4 years ago

FYI https://gitlab.com/correaa/boost-mpi3. I don't know any of the details of the implementation, just that it exists and some projects have investigated using it.

mhoemmen commented 4 years ago

@acdemiralp wrote:

Can https://www.mpich.org/static/docs/latest/www3/MPI_Type_create_struct.html forward the difficulties of serialization to MPI, and potentially even allow removing the dependency to Boost.Serialization?

If C++ gets actual reflection, that would let us use MPI_Type_create_struct to iterate over the fields of a class and convert them into an MPI_Datatype. Right now, there's no way in standard C++ to do that.

omor1 commented 4 years ago

If C++ gets actual reflection, that would let us use MPI_Type_create_struct to iterate over the fields of a class and convert them into an MPI_Datatype. Right now, there's no way in standard C++ to do that.

This would probably work for most POD / Trivial / StandardLayout types, but isn't portable to types that don't need all members serialized. I think most high-level C++-based APIs (thinking Charm++ and STAPL here, for instance) use user-provided pack/unpack routines to do serialization. If we can find a mechanism that allows users to easily select which fields of a class must be serialized, that would probably be the way to go.

omor1 commented 4 years ago

I believe the best practice solution to such a problem lies on the user's part: Create a smaller struct of things which will actually be serialized, and put it in a struct which also contains other stuff. If you need sequentiality, use pointer to the serialized struct in the larger struct and store them sequentially separately. Decent, intuitive solution in C++ terms.

I agree that this is indeed a nifty solution. Actually, it should be possible to make a template type with a parameter pack that serializes the types in the order given, something similar to std::tuple. That would allow use in current C++.

mhoemmen commented 4 years ago

Automagical serialization could be a footgun. I'm already uncomfortable with Boost automatically "taking care of" types that have run-time length, like std::string. It's useful for my current project, but I don't like that there could be multiple messages happening when I only typed one (what does that mean for progress of nonblocking messages, for instance?).

rabauke commented 4 years ago

@acdemiralp wrote:

Can https://www.mpich.org/static/docs/latest/www3/MPI_Type_create_struct.html forward the difficulties of serialization to MPI, and potentially even allow removing the dependency to Boost.Serialization?

If C++ gets actual reflection, that would let us use MPI_Type_create_struct to iterate over the fields of a class and convert them into an MPI_Datatype. Right now, there's no way in standard C++ to do that.

Actually, one can do kind of reflection for some generic types as std::tuple, std::array etc. to build MPI datatypes at run time fully automatically and not visible to the user. This was the route that I took in MPL. MPL is a C++11 header-only message passing library build around the MPI standard.

omor1 commented 4 years ago

The problem with using std::tuple and std::pair directly is that as far as I know they aren't guaranteed to be standard layout types and don't provide direct access to the underlying storage.

rabauke commented 4 years ago

@omor1 Not being standard layout types is the reason, why reflection via template magic is performed and an MPI datatype is constructed via MPI_Type_create_struct for each std::tuple type. Access to underlying member storage is gained via std::get and &. To my understanding, a restriction to standard layout types would be only required if one would send data in a memcpy-like fashion in MPI calls, e.g., by sending blocks of raw memory and using MPI_BYTE.

omor1 commented 4 years ago

Oh, I think I understand—you can get the offset from the base of the tuple and thus construct an MPI type for the tuple itself. Very clever! I'd been playing around for a bit with something similar, but I was recursively constructing structures to ensure they would be standard layout and thus be able to use offsetof, since C++ has no way to expand a parameter pack into a set of variables of those types.

VictorEijkhout commented 4 years ago

Well, this discussion went a long time before anyone mentioned MPL. I've been very impressed with MPL, which like mpi4py makes life a lot easier. For instance, data knows which type it is so for the 99.99 percent of the cases where you don't care you don't have to spell it out.

I've started incorporating MPL in my MPI book, hoping that it will find wider adoption. https://web.corral.tacc.utexas.edu/CompEdu/pdf/pcse/EijkhoutParComp.pdf

jeffhammond commented 4 years ago

@mhoemmen

why not build on the C++ networking TS?

I tried a few years ago to get the C++ networking people to support semantics other than HTTP and they were rather hostile. I proposed a fabric TS that behaved like OFI/libfabric was told I just didn't understand what the word "networking" meant.

You may have better luck, but I don't have time to teach SG14 people that Internet Protocol is not the only way to move bytes between computers.

mhoemmen commented 4 years ago

@jeffhammond Ugh, sorry to hear that. I wish I had more time to work on this.

hzhangxyz commented 3 years ago

With c++ coroutine maybe we can write something like this?

auto value = MPI::Async::Receive(xxxxxx);
something_else();
use_value(co_await value);

mhoemmen commented 3 years ago

@hzhangxyz Senders and receivers (as in P0443) might be a more natural formulation. (The latest version of P0443 synchronizes senders and receivers with coroutines.)

As @jeffhammond points out, MPI folks have had a hard time engaging with C++ networking folks. However, that should not prevent a C++ MPI interface from using things like senders and receivers.

hzhangxyz commented 3 years ago

So, mpi should implement executors/sender/receriver interface and async/await interface be wrapped from this by c++ standard? async/await interface is necessary because this improves readability.

mhoemmen commented 3 years ago

@hzhangxyz I haven't studied this problem as deeply as Jeff has. A lot of people would be very happy just being able to send and receive their custom types, without needing to write custom MPI_Datatype or pack and unpack functions. I've spent far too much of my career on MPI_Datatype, pack, and unpack. MPI_Request has caused me relatively much less pain. Thus, if I were being paid to write a C++ MPI interface, I would wait until reflection reached the C++ Standard. I'm not currently being paid to write a C++ MPI interface (would be fun though!).

hzhangxyz commented 3 years ago

@mhoemmen Hmmm, what I means is co_await interface only, which maybe used in receive, barrier or something else, not how to pack/unpack a customed type. I just feel bothered with async operators of mpi. I don't know about implement of mpi, but it seems such an interface will not force to use some specific way to send/receive data in back layer?

mhoemmen commented 3 years ago

@hzhangxyz The most important thing to remember about asynchronous MPI operations is that "nonblocking" need not mean "makes progress in the background." "Nonblocking" means "returns before it's safe to reuse the buffer." Suppose that you have two MPI processes, with Process 0 issuing an MPI_Isend, and Process 1 issuing an MPI_Irecv. It's legal for an MPI implementation to do the following:

On Process 0, MPI does nothing until MPI_Wait, then copies the send buffer into some internal local storage. At that point, MPI_Wait may return on Process 0.
On Process 1, MPI does nothing until MPI_Wait, at which point it blocks until it gets the message from Process 0.
At some unspecified point in the future (that respects message ordering rules), Process 0 actually sends data from its internal local storage to Process 1.
Process 1 continues to block until it gets all the data, at which point MPI_Wait returns on Process 1.

MPI doesn't necessarily do this, but the point is that you can't assume that MPI makes progress in the background. This matters a lot for things like MPI_Iallreduce, which involve several rounds of sending and receiving messages. You can ask MPI to do this automatically, but sometimes it's faster to drive MPI progress manually. (Paul Eller wrote a UIUC PhD dissertation on this recently.) All this suggests that an interface that hides the details of MPI asynchrony might not be a zero-overhead abstraction.

acdemiralp commented 3 years ago

Hello, I implemented a 4.0 wrapper using C++20 (to my best knowledge) here: https://github.com/acdemiralp/mpi

It covers almost all of MPI (see the coverage in readme) but is not thoroughly tested yet. Aside from testing and making sure it is convenient to use, I plan the following improvements: https://github.com/acdemiralp/mpi/projects/1

I'm fully open to feedback.

correaa commented 2 years ago

Hi all,

Sorry I am late to the conversation. Thank you @raffenet for the mention. I am the author of https://gitlab.com/correaa/boost-mpi3.

Please let me know if you have any feedback or feature request on the library. If you are not using the library I would still like to know what feature(s) would you like the library to have in order for you to use it.

The library is also mirrored here https://github.com/LLNL/B-MPI3; there is some recent support from LLNL to improve the library and promote it. For example, to extend the documentation for any particular topic.

I have been thinking about all the points raised in this thread so far, I think they are very good points. I would be happy to discuss them one by one here or elsewhere. In this new phase I will concentrate in thread-compatibility and non-blocking operations, which have partial support at this point.

I am receiving feature requests here: https://gitlab.com/correaa/boost-mpi3/-/issues https://github.com/LLNL/b-mpi3/issues

I am open to merge requests (thanks to all that forked so far) as well.

I think part of the goals of the library is to incorporate and facilitate existing/proven usage patterns, beyond the trivial ones in the basic literature. So if you have neat examples I would be happy to rewrite them with B-MPI3.

sg0 commented 2 years ago

We recently wrote a paper (in ExaMPI 2021 workshop) about our experiences with MPL and discussed prospects of modern C++ abstractions in the context of MPI-4/5. We extracted a subset from MPL and translated a few benchmarks from OSU and the LULESH miniapp. https://github.com/mpi-advance/mpl-subset

I would like to compare (not necessarily performance) with the implementations of @acdemiralp and @correaa - thanks for the introduction.

The Languages WG convenes every two weeks and I recommend folks to join.

bangerth commented 2 years ago

My StackExchange answer from 2013 is posted at the very top of this thread, but it's been 9 years and so let me add a couple more thoughts I've had since then. While I would love to have things such as chainable .then([&](){...}.then([&](){...}, I recognize that these are a substantial deviation from what MPI has so far provided.

Here are a couple smaller things that shouldn't actually be that hard to do:

MPI defines a lot of constants (e.g., MPI_INT, MPI_COMM_WORLD) that in practice are implemented via some magic #define something. It would already be nice if these could be made C++ constexpr things. (See also https://github.com/open-mpi/ompi/issues/10017.)
At the very top of every C++ programmer's wishlist should be interfaces to MPI that actually return whatever they are producing by-value, rather than through arguments; as a corollary, they would have to indicate errors through exceptions (which would also have the benefit that one could indicate errors using a reasonable mechanism, rather than mapping them onto nonzero integers). The current interface is really very clunky, for two reasons: (i) I'd venture the guess that 50% of all MPI calls "forget" to do error checking because it requires the definition of an otherwise unused ierr variable and writing a second line that checks whether it is nonzero, and (ii) a lot of things can't be expressed in a single statement: for example, you can't write if (MPI_Comm_rank(my_communicator) == 0) { /* I'm root */ }. Right now you need three: The declaration of an otherwise unused variable, calling MPI_Comm_rank, and then the if statement. In reality, you need four because you also want to error check the call to MPI_Comm_rank. So 4x the amount of code one really needs. MPI programs would be a lot more expressive and a lot easier to read if we could get away from the "functions return errors via integer return codes" model. Imagine how easy a code would be to read if, for example, immediate functions were to return an MPI_Request objects :-)

Here is a medium-sized thing or two:

If we already accept that MPI functions should return objects by-value and express errors via exceptions, then it is a relatively small step to say that the immediate functions (like Isend, Irecv) shouldn't just return an MPI_Request object, but instead something like a std::future<void> that one can .wait() for. This isn't actually difficult to implement on top of the current interfaces: One creates an MPI_Request, calls the existing Isend/recv functions, and then puts the MPI_Wait operation on the request into a lambda function that is given to std::async whose std::future return object is then returned to the user.
C++ programs being what they are, they often want to send around unstructured data of variable size. As I learned recently in deal.II, unstructured some-to-some operations are not all that easy to implement. In essence, I'd like to do something along the lines of sending a std::map</*target_rank=*/int,/*target_message=*/T> around and receive a std::map</*sender_rank=*/int,/*sender_message=*/T> back. I would already be happy if that could be done with a C interface where T is one of the usual MPI-supported data types of arrays thereof. So this is not really a C++ question as much as a request for a C interface where sender/receiver ranks could be given as arrays and the messages through an array of pointers (or maybe offsets into a long array of type T[]). But it surely would be nice to get std::map objects.

Other things that are high on my list, but harder to achieve:

We've recently really run into a lot of bugs where something exceeds 2^31 bytes. The current interfaces just silently fail. At the very least, make the count arguments to MPI functions 64-bit. I recognize that that breaks the ABI, so understand that that is difficult.
In deal.II, we have ended up writing a lot of interfaces around MPI functions that can send and receive arbitrary data types. In deal.II, we do this by utilizing the BOOST serialization framework (any other serialization library would do as well). It enables a very nice way to send around any data type, not just the handful of supported types MPI currently knows of. Of course, it may also be much less efficient to do it that way. It also adds a dependency on exactly how objects are serialized. As a consequence, I'm not sure this is useful to do in MPI.

correaa commented 2 years ago

Hi,

When I got my hands on the article "MPI Language Bindings are Holding MPI Back", I wrote a couple of notes (for myself) as friendly critique to the paper. I don't disagree with the paper, I just think that there are harder problems than the ones mentioned in the paper:

I will leave here the link to these notes: https://gitlab.com/correaa/boost-mpi3/-/wikis/A-critique-on-%22MPI-Language-Bindings-are-Holding-MPI-Back%22 .

In addition to that, to add to what @bangerth just wrote,

0) constexpr. MPI is a runtime system and even if somethings could be defined constexpr I don't think the system can do much with them in terms of composing more compile time operations. constexpr Datatypes seems like something useful although the MPI system has be able "compile" or "bless" them at compile time for them to be useful. Also, I would say that is a very particular subset of all useful Datatypes (e.g. arrays of dynamic size).

1) Continuations: The problem of "continuations" is also very important and I myself have a pressing need for this in the C++ interface I propose, because except for trivial datatypes and trivial datastructures (arrays) I almost always need to attach encoding or decoding tasks to the communication task. Sometimes what I need can be regarded as a continuation (like decoding a serialized packet) but sometimes is something that needs to be executed before the communication task, like packing data asynchronically. So generically what I need is to be able to reuse the available MPI threads to piggy back, at the least, some O(N) data manipulation. I started doing things in this direction but I left it for lack of time.

A related problem with asynchronous operations is to see if there is any idiom available to C++ that can allow "marking" data or values as being "locked" into a request, perhaps by some combination of smart pointer (for ranges) or move sematics (for values). Or, for that matter, anything where a static analyzer, or the compiler, can help. (e.g. something similar to "use variable after move", or in this case "use variable after asynchronous request has started but not finished")

2) Value semantics: I couldn't agree more. In my library, I experimented with two types of interfaces. One that takes iterators generically, and the other deals with values and incidentally defines the concept of a "process". As can be seen in the examples. A collection can be send an received in the canonical form:

std::vector<double> v = ...;
std::vector<double> w = ...;
comm.send(v.begin(), v.end(), 1 );
comm.receive(w.begin(), w.end(), 0);  // see elsewhere the discussions on a less redundant interface comm.receive(w.begin(), 0)

or a value based interface:

comm[1] << v;
comm[0] >> w;

(see details here: https://gitlab.com/correaa/boost-mpi3/-/blob/master/test/process.cpp#L51-58)

Note that I am all for dealing with values, but not necessarily "return" them from functions. Returning values is not natural for IO in my opinion, and always tends to generate more allocations than needed. (Think of the case when w doesn't need to be resized above)

Interestingly, move semantics can implicitly hint the library to use asynchronous operations, which would simplify the interface tremendously. For example: (This is not implemented yet).

auto unique_req = (comm[1] << std::move(v));
comm[0] >> w;
... // v cannot be (mostly) use yet, and it is clear to the user (and to a static analyzer
v = unique_req.get();

This is not perfect still, because std::move still allows operations with no preconditions to be performed on the variable. In Rust one can "steal" the variable completely but I am not aware how to do it in C++, except for the partial solution above. So, for the idiom to really work and be fool proof one needs to really move v into unque_req above.

3) I also agree that error handling should be done via exceptions. The hard part is to write exception safe code around it, including MPI (or MPI interface) code. I also have the view that exceptions should not be the defined behavior of logical errors. (They can be defacto implementation of undefined behavior but one shouldn't be forced to handle them.) The point is that when I see the error codes reported by MPI functions, 90% of them are logical errors. (For example, invalid communicator.) Some basic functions do not report any non-logical errors anyway, even if we all know that they can happen they are not reported error codes, which begs the question, what can we do from the C++ perspective really? One would expect to get runtime errors when the network is down or things like that but they are not reported AFAIK. Perhaphs I don't know enough to have an opinion about this.

4) At the time, when asked by LLNL I contributed my two cents about big count. The main idea I transmitted was that without big count it was impossible to send data structures such as std::deque and datatypes wouldn't help, because it is not a matter of the number of elements, but the size of gaps between elements comming from independent allocations. I started implementing a fallback mechanism for when big "pointer differences" or big "number of elements" are implicitly used but is was a lot of work.

5) I have strong opinions about serialization, I think it is fundamental. Serialization is an integral part of value semantics and regular types. Datatypes is at best an optimization over serialization and it doesn't cover all cases. Boost.Serialization (what I use) has lots of issues, specially not being header only and being old but it is a good canonical model. What I am working on is into having the option to use different serialization backend, such as Cereal.

VictorEijkhout commented 2 years ago

On , 2022Feb16, at 17:26, Wolfgang Bangerth @.**@.>> wrote:

If we already accept that MPI functions should return objects by-value and express errors via exceptions, then it is a relatively small step to say that the immediate functions (like Isend, Irecv) shouldn't just return an MPI_Request object, but instead something like a std::future that one can .wait() for.

“Std::future” is a loaded term that comes with a lot of baggage. (Am I the only one to think that C++ threading is a mess?)

The MPL interface to MPI has:

auto request = comm.isend( stuff ); request.wait();

What are you wanting beyond that? A lot of the “std::future” functionality would require wrapping MPI_Test/Probe to realize, and that would take it far from the C/F interface to MPI.

We've recently really run into a lot of bugs where something exceeds 2^31 bytes.

MPI 4 at your service.

Victor.

bangerth commented 2 years ago

On 2/17/22 11:17, Victor Eijkhout wrote:

“Std::future” is a loaded term that comes with a lot of baggage. (Am I the only one to think that C++ threading is a mess?)

The MPL interface to MPI has:

auto request = comm.isend( stuff ); request.wait();

What are you wanting beyond that? A lot of the “std::future” functionality would require wrapping MPI_Test/Probe to realize, and that would take it far from the C/F interface to MPI.

In the end, std::future isn't so bad. How you internally implement making the future "ready" is something independent of the interface chosen. std::future has the advantage that everyone is familiar with it, and that it allows storing an exception in it if the communication ends up failing; it can also be shared. Inventing a different solution has its costs as well.

But these are all ancillary considerations. The purpose of this 'issue' is to collect ideas.

jacobmerson commented 2 years ago

As @VictorEijkhout says in C++ futures are a bit of a loaded term and use of std::future cause all sorts of lifetime/state issues and is not particularly performant due to this need of shared state. I think any forward looking C++ MPI API should consider the async utilities that are coming into the language via coroutines and std::execution/p2300.

bangerth commented 2 years ago

On 2/17/22 12:09, Jacob Merson wrote:

As @VictorEijkhout https://github.com/VictorEijkhout says in C++ futures are a bit of a loaded term and use of |std::future| cause all sorts of lifetime/state issues and is not particularly performant due to this need of shared state. I think any forward looking C++ MPI API should consider the async utilities that are coming into the language via coroutines and |std::execution|/p2300 http://wg21.link/p2300.

I'm all for this kind of stuff. But do you want to standardize on things that are only available in C++23 or C++26? It's going to be many many years before a lot of project will be able to use this -- most large high performance projects lag about five years behind C++ standards because that's how long it takes for everyone to have compilers that support a standard. So if std::execution is part of C++26, most projects might be willing to use interfaces built on it in ~2031. Or you could standardize on C++11 or C++14 features and projects can start using these interfaces now.

Of course this all assumes the MPI forum has any inclination to provide C++ interfaces to begin with, and do within the next few years.

sg0 commented 2 years ago

Technical reasons aside, there has to be some dedicated funding for getting this work done, since this is not just forum participation and developing myriad modern C++ language bindings. I contributed to 3 LDRD open calls and one DOE proposal solicitation (jointly with more established/senior scientists in this area) in the last 3 years in trying to get some funding for this work - all of them failed (I am still trying, but mostly pessimistic). I think there is perhaps limited incentive structure for this work in the minds of the senior people, at least in US DOE.

bkmgit commented 2 years ago

US DOE has traditionally been important, but MPI is used in a wide range of codes. An important additional consideration is use in industry. Examining software such as OpenFOAM may be helpful to get some idea of used features. Some C++ applications may also choose to directly build on top of UCX.

correaa commented 2 years ago

@bangerth,

The good thing about the word "future" (and continuations) is that many people knows what it means and it is a good initial sketch in principle.

Having said that, it is important to recognize the std::future in its current status might be too general and too heavy weight for some family of basic task. Coincidentally on this family there are things that are very related to message passing.

First, std::future are not ideal because they do type erasure on the task (sort of like std::function), they are quite flexible but the best option in all cases. Second std::future contemplates the possibility of tasks failing (throwing) and that has a cost. It also typically needs to allocate the return object, which in turn can be a failure point.

What I found in my experiments is that from the outset, before and after sending a message there is the typical need for encoding and decoding messages (for example [de]serialization). These are the specific tasks we should consider before going to the more general case of an arbitrary continuation. In fact, while decoding can be seen as continuation, encoding is not, it is more like a prolog.

Also, it is interesting to consider that encoding and decoding tasks can be made/programmed in such a way that they cannot fail (and not throw). Therefore in principle it is possible to disregard exceptions in this context.

Additionally, as I mentioned in other posts, I don't think that returning objects or values are a good idea, and this extends to asynchronous messaging too. There are several reason for that and even a specific reason in this context. If these future-like request return iterators-like objects instead of new value then we don't need to even worry about exceptions thrown during construction.

In summary, for request or future-likes that do not return values and are that restricted to only do encoding and decoding (or more generally epilogs or prologs that cannot fail and be noexcept) the implementation doesn't need to be as complicated or as heavy as what std::future offers right now.

Feedback on these ideas will be appreciated too.

correaa commented 2 years ago

You can use a std::expected instead of throwing. Even nicer is to allow both via macros.

any problem can be solved adding a level of indirection, except too many levels of indirection. (std::expected is the indirection here)

More seriously, i think returning values (or expected) do not reflect what MPI communication ultimately is, IO. In the IO picture, object exists (maybe in unspecified but valid state) before communication.

returning values forces allocation even in cases where it is obvious it is not needed. (think of the case of receiving into a vector that already has enough capacity to receive the number of elements sent)

I do not understand why you are occupied with the idea of byte-level serialization, which to my knowledge is last resort practice.

i don't know in general, but in my case it is not byte-level serialization. the fundamental block of serialization are typed packages of basic types. i call it encoding for the lack of a better word. what i refer to is a standard transformation of a data structure into packed format that both ends of a message have to agree upon. also, byte-level serialization would break endianness compatibility, which, i won't defend, but it is a nice to have.

If you have proper reflection, or even precise flat reflection like MPL's or Boost.PFR, you often do not need byte-level serialization.

(static) reflection can get you so far. it doesn't solve all the problems. reflection is ok for generating custom data types which can be known at compilation but not much more. it doesn't help with dynamic data structures (e.g. a multi block data structure, like std::queue or a CSR matrix) or MPI data types that in practice would take about the same memory as the size of the message itself (e.g. std::list).

I also do not understand what problem you have between std::future and serialization.

No problem, i am just pointing out that std::future are made to handle almost any kind of tasks.

And serialization, that is an important example for the need “ continuation", is not a general task, but a simpler one.

If you want one or more intermediate (de)serialization steps that are not async, then make them async compatible via https://en.cppreference.com/w/cpp/experimental/make_ready_future instead of opening callback points for them or using asymmetrical packing and unpacking to confuse the user.

i have to think about that. yes, the idea is that generic asynchronous messaging (like in BMPI3) needs preprocessing or postpropocessing.

i would like to make this processing 1) asynchronous also, 2) optimally use the resources (threads, buffers) already given to MPI. i don’t know how to do exactly yet. this part is also work in progress.

Which iterators? Iterators of contiguous sequential containers (span, string, valarray, vector<!bool>)? Or iterators of non-contiguous sequential containers (deque, forward list, list, vector<bool>)? Or iterators of associative containers (map, unordered map, set, unordered set)?

All of the above, depends on the case. It can even be pure input and output iterators. (not that i recommend using them).

The BMPI3 "basic" interface is iterator-based, as you indicate. it also returns other (new) iterators in the cases where the internal computations are hard or impossible to replicate outside the message call.

(STL is designed with the same philosophy, although not always got it right).

the asynchronous versions are not different in principle, in the sense that the request could return (like via future::get) iterators. This is work in progress.

The latter two do not ensure contiguity, whereas MPI often prerequisites contiguity.

sure, low level interfaces require contiguity. (think of memcpy)

high level interfaces try to take advantage of them through direct or indirect means, even when data is not contiguous. they do whatever possible for them with whatever resources it has available, heuristics, buffer, pinned memory, data types, packed-level serialization, byte-level serialization, etc. and yes, under sufficiently complex situations they can fail to do their job efficiently (while still doing the job correctly).

MPI forces a C mentality, we think how to use them through contiguos arrays, and it is fine. BMPI3 has a C++ mentality, (or STL). It will try to do the best job possible and the idea is to have a decent base level of quality of implementation which will be work in progress for a while, and any help will be appreciated.

This is also confusing to me in your library Boost.MPI3. What happens when I pass a std::unordered_map::begin() and std::unordered_map::end() to your functions that accept iterators? Does my map get copied to contiguous memory e.g. a std::vector<std::pair> and then transmitted?

very good question. (the answer has many corner cases because you didn't say what are the element types, but i am going to ignore this and assume the best possible scenario, that the datatype is a builtin).

but, yes, broadly speaking, what you describe is a good starting point solution. (i will add some levels of details as we go.) after all what is the alternative otherwise? partition the message in N smaller messages with one element (or pair) each)? that is, as you know unacceptable.

the solution you propose works and one has to accept that the user had a very good reason to use a unordered_map to begin with. the user has to know the cost of transversal in general and communication in particular of such specialized data structure.

an important point before continuing is that if you pass a pair of iterators the library lost already the information that the container is associative.

the only information that it has is that the range is defined by a pair of iterators that are bidirectional iterators and that the elements are decomposable as pairs.

Where does that std::vector<std::pair> live if the call is immediate?

ok, yes, assuming we are going this route then the vector lives in some sort of free store. a possible candidate is the default heap (std::allocator) and that would work.

but we can do better, we have access to the MPI system as well, and to the communicator, with all its hypothetical buffers. we also know we are copying to the vector for the sake of communicating, nothing else.

Therefore what the library should do is to put the vector in MPI pinned memory, which if it is available, can make the communication faster).

what if there is no enough pinned memory?, well, then a series of few smaller intermediate vectors can built and sent, one at a time.

if many vectors are necessary to be constructed and destructed maybe it also a good idea not to allocate each one and use a single one or use a specialized arena allocator.

so as you see, it can get intricate internally. there are levels of optimizations one can take advantage from.

is this the only way to do this? no, i can also take advantage that the elements are pairs so can construct two vectors one for each type. i am not doing this, maybe if it is proven to work across multiple systems, one can write (inside the library) special code for this. what i am trying to illustrate is that one can optimize up to different levels.

What about std::vector::begin() and std::vector::end()? Do you still make a copy like you would in the std::map case or do you somehow detect it and avoid the copy?

no, I don't, first of all, at this point I have a temporary vector and I can send it directly. i know it is a vector.

but anyway, if you were to pass a vector::begin() and vector::end() the library (not necessarily with your help) detects that these are random-access and contiguous iterators so it knows how to handle this case, without intermediate copies.

i will stop the details of what i am doing internally here. i hope the idea is clear even if you disagree with it in general or in the details. the important point is that this is all internal to the library.

You see? Iterators are confusing in this context.

sorry, no, i don't see. what is confusing about this? this is work that the library does for you. if the implementation i described confuses you that’s fine: it is just that, an implementation; it is enough for you to know that an unordered_map has costly transversal and it is not contiguous. and if your dataset is small enough you can even get away with not knowing that.

when you use iterators… do you worry if they use memcpy at some point below? maybe, maybe not. if you don't have many elements you might not care. of course if you want performance you need to know your data structures: do not expect that unsorted_map would be able to take much advantage of hardware or low level MPI primitives.

to finish, the two types of iterators that you mentioned belong to two different iterator categories, and they naturally have different performance guarantees.

In summary, for request or future-likes that do not return values and are that restricted to only do encoding and decoding (or more generally epilogs or prologs that cannot fail and be noexcept) the implementation doesn't need to be as complicated or as heavy as what std::future offers right now.

Yes as you can see in the 89 liner above.

Yes to what exactly? (what is the “89 liner”?)

yes to that prologues and epilogues do need to be handled by things as heavy as futures?

maybe, i didn't write all the possible epilogues and prologues that could be necessary so, yes, this is, until proven correct a guess. the fundamental difference is that prologues and preambles do not need to return values, like future are designed to do. my prologues work with elements that are already there in some sense, the do not need to return anything "new".

Thank you for your questions. -- A

bangerth commented 2 years ago

On 2/19/22 12:55, Alfredo Correa wrote:

More seriously, i think returning values (or expected) do not reflect what MPI communication ultimately is, IO. In the IO picture, object exists (maybe in unspecified but valid state) before communication.

Just to be clear, this is not what I wanted to advocate for. The actual send and receive buffers should be allocated by the user. It is things such as the output integer arguments of MPI_Comm_rank and MPI_Comm_size that would be nice to return, as well as MPI_Request objects by immediate functions.

correaa commented 2 years ago

On 2/19/22 12:55, Alfredo Correa wrote:

More seriously, i think returning values (or expected) do not reflect what MPI communication ultimately is, IO. In the IO picture, object exists (maybe in unspecified but valid state) before communication.

Just to be clear, this is not what I wanted to advocate for. The actual send and receive buffers should be allocated by the user. It is things such as the output integer arguments of MPI_Comm_rank and MPI_Comm_size that would be nice to return, as well as MPI_Request objects by immediate functions.

thank you for the very important clarification.

if you are referring to your quote "return whatever they are producing by-value, rather than through arguments; ...", and by values you didn't mean the values of the communicated data, then, yes, i am in the same page.

maybe @acdemiralp was referring to the same thing as well and i also misinterpreted.

mhoemmen commented 2 years ago

I'm all for this kind of stuff. But do you want to standardize on things that are only available in C++23 or C++26?

P2300 won't make C++23, though it has a good chance at C++26.
I've seen plenty of MPI 1.x code in the wild. This suggests that people shouldn't worry about requiring newer versions of a programming language in newer versions of MPI, because users will always be able to fall back to implementations of older MPI versions.
That being said, a standard should standardize existing practice. Thus, I'd rather see one or more examples of a senders/receivers-based C++ MPI interface first, before considering its standardization. P2300 is a library solution with existing implementations, so interested parties should feel welcome to try this. P2300's authors are open to considering more use cases, so now would be a good time to explore using senders/receivers.
I think MPI (2-sided or 1-sided) is a poor match for senders/receivers, but am open to discussion.

VictorEijkhout commented 2 years ago

On , 2022Feb21, at 10:21, Mark Hoemmen @.**@.>> wrote:

I'm all for this kind of stuff. But do you want to standardize on things that are only available in C++23 or C++26?

I’m all for letting the C++ interface be “syntactic sugar” around MPI:

tag=0 by default
Wolfgang’s function result for Comm_rank & Isend
no receive buffer for non-root reduction

Considering what a terrible mess threading is in C++ (every next standard seems to say “Oh no, we should have done it this way”) I think it’s a bad idea to adopt that terminology for sends/requests/whatever.

I think MPL is striking a good balance: C++17 where it simplifies expression, but no introduction of syntax with loaded meaning.

Victor.

mhoemmen commented 2 years ago

@VictorEijkhout wrote:

Considering what a terrible mess threading is in C++ (every next standard seems to say “Oh no, we should have done it this way”)....

I'll fight you on that one, my friend Victor : - ) .

std::thread is a perfectly fine wrapper for an operating system thread. It never aimed to be anything more.
Regarding "every next standard seems to say...," the only way in which the Standard has actually changed was in discouraging use of release-consume memory ordering. That came out of some recent academic work. I've never seen code in the wild that uses this ordering.
I've written and used thread-parallel C++ code for over a decade. It works fine and it runs at scale.

You don't have to like C++, but phrases like "terrible mess" just aren't accurate. I would say MPI is a bigger mess; consider, for example, how long it's taking the community of MPI experts to decide what MPI_THREAD_MULTIPLE means.

ibaned commented 2 years ago

Reading through some of this discussion, it strikes me that the primary pitfall is the sheer size and complexity of ISO C++ and the temptation to ask ourselves how an MPI interface might be compatible with every single feature of C++.

Thinking of how an MPI interface could interact with ranges, reflection, threading, executors, etc. is an exciting exercise but seems to lead to an MPI interface that is as large as the ISO C++ standard itself.

My thought is that the C++ interface to MPI should look more like the MPI standard than the ISO C++ standard. By this I mean that it should mainly consist of applying tried-and-true (albeit less exciting) C++ features consistently over the whole interface. I'm convinced enough of this principle of simplicity that I made a C++ interface to MPI that I am using in large projects:

https://github.com/sandialabs/mpicpp

Here are the tried-and-true, non-controversial and non-daunting features of C++ that it applies to MPI so far:

RAII for requests, communicators, etc. with unique ownership and move semantics. This also encompasses non-blocking semantics by having the destructor of a request wait on the request. Ignoring a returned request is equivalent to calling a blocking function.
Exception-based error handling. Throws exceptions everywhere that the C MPI interface returns an error code.
Deduction of MPI_Datatype for C++ types but only for pre-defined MPI_Datatypes

Personally, I don't currently have code that sends user-defined structs or maps of lists that is begging for reflection, nor code that calls MPI from multiple threads that would really benefit from concurrency compatibility.

I think a minimal system like this would be a good starting point, and over time it can add compatibility with more and more C++ features. Adding compatibility with a new feature should consider carefully the maintenance cost of this part of the MPI C++ interface (both standardization and implementation), the stability and user experience of the C++ feature itself, and the clear benefit to existing users of MPI.