devreal commented 2 years ago

Problem

The standard requires that updates from single-element RMA accumulate functions (MPI_Fetch_and_op) and bulk-accumulate functions (MPI_Accumulate) are atomic with respect to each other. Since the number of elements passed to MPI_Accumulate is not known a priori, implementations typically fall back to a scheme that provides high throughput for high numbers of elements at the cost of latency of small (single) element accumulate operations and (in some cases) progress dependency at the target. This makes RMA accumulate operations less than ideal for application wanting to use low-latency network atomic operations.

Proposal

Add an info key that allows the application to specify a preference for either latency or throughput of accumulate operations.

Changes to the Text

Add a new info key in the RMA chapter.

Impact on Implementations

In general: none, since it's only an info key. If they want to play nice, they have to add support for that info key and provide two pathways for implementing RMA accumulate operations.

Impact on Users

Making use of atomic memory operations in the NIC is useful for some applications. Users won't have to rely on the implementations to make the right choice for them, because they don't.

References and Pull Requests

https://github.com/mpi-forum/mpi-standard/pull/749