qkoziol commented 2 years ago

Problem

Many modern HPC systems have hardware offload capabilities for MPI operations, which could asynchronously execute communication and I/O operations concurrently with application computation. However, MPI does not currently have operations that take advantage of these capabilities in a coherent and user-friendly manner.

Proposal

Improve performance of MPI applications in several ways: reduce synchronization penalties for collective operations, fully overlap communication, I/O, and computation, allow sequences of MPI operations to be optimized together, and enable more MPI operations to execute concurrently with application computation.

Changes to the Text

This will probably require an additional chapter, and possibly more than one.

Impact on Implementations

Implementations will require new capabilities, mainly focused around asynchronous execution and dependencies between operations.

Impact on Users

Building a robust set of extensions to add “true” asynchronous operations to MPI can achieve the goal of improving application performance by reducing the costs for communication and I/O to nearly zero.

Additional benefits of achieving the primary performance goals in an elegant and well-designed way are: an improved ‘user experience’ for developers using asynchronous (currently ‘nonblocking’) operations, hiding the latency of operations, exposing more opportunities for optimizing performance of MPI operations, enabling offload of more operations to networking hardware, and enabling applications to build powerful data movement orchestration meta-operations.

References and Pull Requests

None yet.

hppritcha commented 2 years ago

I think it would be interesting to see if these types of concepts may fit better in more modern languages than C and fortran. How would this proposal look if one were to use futures/promises in C++ or rust, for example?

icompres commented 2 years ago

@tonyskjellum is this related to ExaMPI?

My understanding is that this is an implementation aspect, or is there an inherent limitation in the MPI spec itself?

jprotze commented 2 years ago

From my perspective, the continuations proposal (https://github.com/mpiwg-hybrid/hybrid-issues/issues/6) tries to provide one aspect of truly asynchronous MPI operations: Getting notified when the operation is done. The proposal also has thoughts on how to drive progress (allowing implementation on hardware that does not allow asynchronous progress). EuroMPI'20 had two papers on this topic:

MPI Detach - Asynchronous Local Completion (https://dl.acm.org/doi/10.1145/3416315.3416323)
Fibers are not (P)Threads: The Case for Loose Coupling of Asynchronous Programming Models and MPI Through Continuations (https://dl.acm.org/doi/10.1145/3416315.3416320)

Asynchronous IO in libc provides a similar notification mechanism using signals/signal handler: https://www.gnu.org/software/libc/manual/html_node/Asynchronous-I_002fO.html

wesbland commented 2 years ago

@qkoziol For now, I’m going to drop the WG tags. Most likely, this will involve a new WG that would work with the existing WGs or chapter committees. Probably not much point in putting it in every WG’s queue in the meantime.

jeffhammond commented 2 years ago

Step 1 in making MPI asynchronous is to require progress in asynchronous operations. The MPI Forum has for decades lacked the courage to require any form of asynchronous progress whatsoever, and that attitude will stand in the way of anything else we try to do on this front.

jeffhammond commented 2 years ago

It used to be a reasonable attitude to view compute resources as valuable and scarce when we had less than 20 cores per node, but now every server platform has at least 100 hardware threads per node, and very few applications are using them. Additional threads per core are almost always unused because shared-memory MPI and thread scaling sucks and doubling N in synchronization operations to get no additional throughput is bad. Furthermore, the performance of server platforms with bandwidth-limited workloads often degrades as applications utilize all the cores.

In short, we have more than enough cores to dedicate them to MPI progress. The problem that remains is coarse-grain locking inside of MPI due to message queue designs from the past, although I know some implementers have been trying to fix this since the Blue Gene era.

We also need the implementation community to take a serious interest in the "my application is not a clown car" info assertions and allow well-behaved use cases to take advantage of hardware tag matching, etc.

We already know how to implement asynchronous progress in communication that uses MPI buffers (i.e. ones allocated in interprocess shared-memory) or when one can enable OS-level support for this (see this).

My guess is that we'll have smart NICs that can support MPI asynchrony almost everywhere before the MPI Forum will vote to require asynchronous progress, so I'm not sure whether we should spin up a new WG until it is proven that semantics are actually the bottleneck.

wgropp commented 2 years ago

I agree with Jeff. I can still see room for optimizing some particularly BSP applications with a weaker progress implementation, but that should be just that - a special optimization justified by application need. The default should be good progress.

tonyskjellum commented 2 years ago

I am obviously in concurrence with Bill's @wgropp's latest comment :-)

mpi-forum / mpi-issues

Add truly asynchronous MPI operations #585