jdinan commented 1 year ago

Problem Statement

RMA is too complicated. As a result, few high quality implementations exist, user adoption is impacted, and the chapter itself lags behind the rest of the MPI specification with the adoption of new terms and other specification-wide changes.

Proposed Changes

Deprecate/remove:

Active target synchronization
Lock/unlock synchronization
Separate memory model
Displacement units
Accumulate operations

Introduce

A new window flavor:

That creates a window already in the lock-all state
Extends window destruction to perform the semantic equivalent of unlock-all
Allows the window handle to be duplicated so that threads can each have independent views of window memory

A new window synchronization operation:

A collective "barrier" (or re-definition of MPI_Win_fence) that flushes pending communication and synchronizes processes (similar to a SHMEM barrier)
We can add an info assertion that all synchronization is done with such a function to enable active target synchronization optimizations, if desired

Atomic APIs

Element-wise atomics to replace operations lost when we remove MPI_Accumulate and MPI_Get_accumulate (e.g. non-fetching atomics)

Fix

Same-op-no-op accumulate semantics

devreal commented 1 year ago

For 7), I propose MPI_Win_hedge (a natural fence). Just to confuse everyone with a new term and because all other terms are already taken in MPI...

Let me add my wish list here:

8) P2P window functionality that allows a process to allocate memory, expose it, and send a handle to another process to create a window for RMA communication. Like Send/Recv, but without the tag matching and rendezvous protocol overheads and instead providing random access capabilities within the exposed memory. Yes you can allocate all memory in a window and use a custom allocator but that is cumbersome and collides with any parts of the application that doesn't use this allocator.

jdinan commented 1 year ago

@devreal This is essentially what dynamic windows are supposed to provide. The producer process attaches memory to the window and the MPI library has some protocol for exchanging memory registration handles with all peers in the background. I guess the difference you are proposing is that MPI_Win_attach should return an opaque handle that the application distributes and is used by consumers to attach that memory to the window on their side? Or was your proposal more to do with the data transfer operations?

devreal commented 1 year ago

Yes, that's what dynamic windows were meant to provide but in practice their performance is sub-par, mostly because a pointer doesn't provide enough information to utilize RDMA without querying the target for the registration handle. So you end up with extra round-trips on each access. I'd rather hoist moving the registration handle out of MPI and have the application manage that, instead of relying on MPI to eventually figure it out (which it hasn't, after over a decade).

devreal commented 1 year ago

Note from the January 19 meeting:

P2P synchronization mechanism (put with signal / pick your name) that allows pipelining data transfer and synchronization. PSCW provides that but is not the answer.

devreal commented 1 year ago

Another idea I had:

Tying datatypes to windows (or access objects) instead of operations (datatypes in MPI are expensive to parse and don't transfer well to devices for example). Restricting the set of datatypes used at window creation time (https://github.com/mpiwg-rma/rma-issues/issues/22) might not be enough since we would still need to pick a type in the operation. Windows views with a single type might be an option. The operation then just specifies an offset and count but not a type.

jeffhammond commented 1 year ago

RMA operations that only take a count + predefined datatypes would be useful.

tschuett commented 1 year ago

For scientific articles surprise is great. You managed to do that? Wow. For performance, it is opposite. You don't want to surprise MPI. MPI: You want to me do what? This is the story of persistent collectives. An MPI_Start does not surprise MPI anymore. MPI: I knew that was coming. Maybe you can achieve the same for RMA. Register Ops at window generation. Then you can spend the rest of the day with MPI_RMA_Start. MPI: I knew that this was coming.

devreal commented 1 year ago

I have been thinking about persistent RMA but what stops me every time is that it won't be much different from partitioned P2P. I'd like to retain some level of flexibility (random access, arguably one of the strengths of RMA) while cutting out the parts that are hard to optimize. But I am open to be convinced that persistent RMA is actually useful :)

tschuett commented 1 year ago

I see. This sounds like partial persistence. The expansive parts, the surprise, is persistent. While the cheap ones: random offsets are not.

jeffhammond commented 1 year ago

RMA is already persistent in most ways. All of the memory registration is persistent (except perhaps in dynamic windows, which we should fix). I suppose there are cases where networks need to register peers and that is not guaranteed to happen during window construction, but there is nothing preventing an implementation from doing so. If they don't, it's likely because peer registration caching is a memory hog. We could solve that by adding collective lock all (which was proposed during MPI-3 by Charles Archer, IIRC) that would make it possible to make the communication peers persistent for an epoch.

jeffhammond commented 1 year ago

A large number of RMA use cases, including NWChem and many traditional PGAS workloads where SHMEM and UPC are used, have random access patterns. It is contrary to these use cases to attempt to make offsets persistent. I'd argue that those use cases are better matched to persistent send-recv anyways.

tschuett commented 1 year ago

If I have 1000,000 ranks, it is cheaper to promise MPI: I will only talk to these x, where x is small, ranks.

jeffhammond commented 1 year ago

I want to see the details of that argument. It's not valid for Blue Gene or Cray Aries networks, for example. If it's true for Slingshot or Mellanox, please walk me through why.

devreal commented 1 year ago

All of the memory registration is persistent

That is true for the target memory but not for the origin, which can be any buffer. My understanding is that some networks require a registration of the origin buffer too. Would it help (esp on accelerators) to get guarantees from the user on the memory range used?

tschuett commented 1 year ago

I believe that you are mixing two concepts. Slingshot and Mellanox are RoCE resp. Infiniband. Thus, you need connections for communication. The other challenge is memory registration and the exchange of keys. There is a notable difference between exchanging keys with 1000,000 or small x ranks. Libfabric pretends that Aries does not need connections, but I believe underneath they do use connections.

devreal commented 1 year ago

We're kinda digressing in this issue. I'd prefer to keep this rather clean and focus on features. Otherwise it will get too crowded, which makes it hard to find the important bits later.

tschuett commented 1 year ago

If partial persistence is a thing, then MPI supports the full range from no to full persistence. The user will give as much information as possible during the window creating. The offset will be random. I will only talk to these ranks. I don't know yet to which ranks I will talk. The datatype is fixed on window creation.

mpiwg-rma / rma-issues

Proposed RMA Overhaul #25

Problem Statement

Proposed Changes

Deprecate/remove:

Introduce

A new window flavor:

A new window synchronization operation:

Atomic APIs

Fix