mpiwg-rma / rma-issues

Repository to discuss internal RMA working group issues
1 stars 0 forks source link

Deprecate (and remove) MPI_Accumulate #24

Open devreal opened 1 year ago

devreal commented 1 year ago

This was mentioned during the WG call today by @jeffhammond and after some thought I really like the idea, so I put down my thoughts on it.

MPI_Accumulate is the root of all evil when it comes to atomic operation performance in MPI. It allows users to mutate an unbounded number of elements with element-wise atomicity guarantees, which span across all accumulation functions (incl. single-element MPI_Fetch_and_op). No hardware in existence today (and likely in the future) will provide efficient accumulation of more than a few elements at a time, forcing implementations to fall back to a software emulation approach to guarantee atomicity between MPI_Accumulate and MPI_Fetch_and_op. This prevents MPI_Fetch_and_op from making proper use of network hardware and has been a source of great frustration.

In essence, the MPI standard contains a function that prevents us from using low-level hardware features. It has spurred a line of proposals to mitigate its impact (https://github.com/mpiwg-rma/rma-issues/issues/8, https://github.com/mpi-forum/mpi-standard/pull/93) that went no where and are merely band-aids. It\s also one of the main drivers for the new allocation function (https://github.com/mpiwg-rma/rma-issues/issues/22). Instead of spending another decade on trying to overcome these shortcomings we should remove multi-element accumulate.

But I want to accumulate megabytes of data?!

Sure, MPI RMA provides you with all the functions needed to implement get-reduce-put with support from the hardware for data movement. We also provide mutual exclusion. With continuations (https://github.com/mpiwg-hybrid/mpi-standard/pull/1), you could even do that without blocking on the get or put. Or you can implement something akin to AMs using send/recv, if that fits your needs. A function that cannot make (and inhibits) proper use of hardware capabilities has no place in an API that aims at exposing low-level hardware features. You wouldn't accept a language that cannot make use of CPU AMOs for any reasonable system-level coding, either.

To summarize 1) Deprecate MPI_Accumulate, MPI_Raccumulate, MPI_Get_accumulate, and MPI_Rget_accumulate. 2) Introduce request-based fetch-op (https://github.com/mpi-forum/mpi-standard/pull/107) to provide an alternative to MPI_Rget_accumulate for single elements. 3) To bridge the time until removal, add an info assertion that you won't use MPI_Accumulate anymore so that we can ignore it.

jeffhammond commented 1 year ago

We have to do lots of edits to address our use of accumulate everywhere we mean atomics, in order for the text to make sense if we remove accumulate.

devreal commented 1 year ago

You're right, we use "accumulate" throughout the chapter to say "atomic memory operations", which needs some adjustment.

I've been thinking whether we should provide a replacement with a separate atomicity domain, along the lines of MPI_Get_op (replaces MPI_Get_accumulate) and MPI_Put_op (replaces MPI_Accumulate). One reason we might want to keep multi-element atomic functions is to leverage specialized hardware (SmartNICs?) for the accumulations, which is arguably harder to do (portably) from the application. The difference to MPI_Accumulate is that the new functions do not provide atomicity guarantees with single element functions MPI_Fetch_and_op and MPI_Compare_and_swap.

Alternatively, if we want to salvage MPI_Accumulate we could provide a way for users to assert that they don't need atomicity between multi-element and single-element atomic operations. This would be an opt-out that enables us to utilize network AMOs for single-element operations. It seems reasonable to expect users either use single-element or multi-element operations, but not mix them. And if they do, it's easy to replace one with the other. And we wouldn't need to go through the pains of expressing the anticipated usage as in #22.

jdinan commented 1 year ago

Accumulate makes more sense in a window synchronization model that doesn't require the network to provide atomicity. For example, active target synchronization, where accumulates can be performed by the target process in software or exclusive lock/unlock where the window synchronization provides atomicity. If we remove these synchronization models (which I think we should) then I think accumulate is on shaky ground.

It's worth mentioning that atomic read-modify-write over PCIe (i.e. RMW performed by the NIC) is very much rate limited by PCIe and does not saturate the network. This is where @jeffhammond chimes in to point out that the origin process can perform get-modify-put to avoid RMW over PCIe, to which I say, let the application do this and optimize it for their window synchronization model rather than building it into MPI.

jeffhammond commented 1 year ago

Accumulate has never required networks to provide atomicity, and I don't know a single implementation in the history of RMA that has used network atomicity effectively. It has always been an active message or - as I learned recently - lock-get-reduce-put-unlock (not win lock, but a memory lock inside the implementation).

I don't really see why applications should need to implement RMA themselves because the MPI implementation community refuses to avail themselves of techniques used in ARMCI 20 years ago. MPICH does a fine job with active messages and locking except (1) asynchronous progress in active messages is rarely available and (2) MPICH RMA locks the entire process instead of doing memory address range locks, which is what ARMCI has done for decades.

I suppose it's my fault for using 1 MPI window per data structure, and I should instead allocate nbytes/4K windows and do my own locking, but then MPI implementers are going to say I'm a perverse user for allocating 87,000 windows and tell me that's the reason I don't deserve performance.

At this point, I think it's fine to deprecate MPI accumulate, but I insist that the MPI Forum write that it is because the implementers have failed throughout history, and not because it's impossible, and then cite ARMCI and Casper papers showing what could have been done if only implementers had actually cared.

jdinan commented 1 year ago

@jeffhammond Apologies, should have stated my assumption -- My concern is that accumulate is atomic with respect to scalar atomics like MPI_Fetch_and_op and MPI_Compare_and_swap. Or am I misremembering the atomic semantics?

jeffhammond commented 1 year ago

Well, it's atomic if and only if it's the same type and same_op_no_op applies. MPI_Compare_and_swap cannot be the same op as accumulate, ever.

wgropp commented 1 year ago

I agree with Jeff here on the implementations, including the memory address range locks.

Maybe its time to back up and say: Since no one will make a serious effort to implement what MPI defines for RMA, maybe we should ask "what has been implemented (e.g., in ARMCI) and only require that." At least we get something useful, if possibly not as general.

devreal commented 1 year ago

As I mentioned, there is a benefit to having bulk-accumulate functions in that the MPI implementation can potentially offload computation to hardware that is either hard(er) for the applications to access or not portable (DPUs, for example). Plus, these updates would still be progressed by MPI instead of users implementing some blocking schemes (like what I suggested initially). It's important though that these operations have a separate atomicity domain and don't interfere with the single-element atomic updates.

jeffhammond commented 1 year ago

Based on everything I know now, I'd use Send+Probe ala active messages to do GA/ARMCI accumulate, because eager protocol takes care of latency and implementations are so bad at progress that it just makes sense to dedicate to have the remote agency explicit to keep everybody honest.

I'm going to work on a prototype of this, to measure the cost of actually removing Accumulate, if we decide to go down that path.