devreal commented 5 years ago

Problem

Currently, the only way to enforce ordering of arbitrary MPI RMA operations is to synchronize and/or wait for (remote) completion of all previously issued operations before issuing new operations. This completion may require a full round-trip in the network, which induces significant overhead especially for short communication intervals, e.g., single-value operations. While MPI provides ordering guarantees for atomic RMA operations on overlapping memory regions there are no ordering guarantees for non-overlapping memory regions or non-atomic operations. However, some algorithms require specific ordering of arbitrary non-overlapping operations, i.e., to ensure the completion of a data transfer before signalling the availability of an update to the target.

At the same time, modern high-performance networks (e.g., InfiniBand, Cray Aries), communication libraries (UCX), and other PGAS abstractions such as OpenShmem provide ways for inserting ordering requests into a stream of RMA operations, commonly called fences. These capabilities are currently not directly exposed through the MPI RMA interface.

Proposal

This proposal adds a new function MPI_WIN_ORDER to partially order RMA operations issued before and after the call, i.e., all operations initiated by the calling process before the call shall complete at the target before operations initiated after the call by the calling process to the same target on the same window. In contrast to a flush or RMA synchronization, a call to MPI_WIN_ORDER does not guarantee remote completion of previously initiated operations upon return but allows users to express their intent to order operations, which may notably reduce latencies and improve communication efficiency in case remote completion can be deferred to a later point in time.

The user may specify a combination of the types of operations to include in the ordering: read operations, write operations, and/or data dependencies (read/write operations to overlapping memory regions), which may allow implementations to avoid memory fences in shared memory communication. As an example, x86 processors guarantee sequential consistency of read/write operations to overlapping memory regions but may reorder independent loads and stores, requiring sfence or lfence instructions to force an ordering among them.

Changes to the Text

The proposal adds a new section 11.5.5 Order describing the function MPI_WIN_ORDER and the types of operations that can be included in the ordering. In Section 11.7.2, the proposal distinguishes between implicit ordering guarantees for atomic operation performed on overlapping memory regions and explicit ordering introduced using MPI_WIN_ORDER. Two examples on the use of MPI_WIN_ORDER are included in Section 11.7.

PR: https://github.com/mpi-forum/mpi-standard/pull/114

~Annotated PDF (changes highlighted using ticket135): mpi-report_issue-135_2019-05-13_18.04.pdf~

Annotated PDF (changes highlighted using ticket135): mpi-report_issue-135_2019-06-25_14.54.pdf

Changes on pages: 476-477, 488-490, 701, and 836

Impact on Implementations

Implementations will have to provide the new function MPI_WIN_ORDER. As stated in the AtoI, a conforming implementation may wait for remote completion of previously initiated RMA operations inside the call to MPI_WIN_ORDER. However, hardware fences should be used if possible to reduce the latency of operations between two flushes or synchronization points involving the same target.

Impact on Users

The proposal does not change the semantics currently present in the MPI standard and does not create backwards compatibility issues. The added function MPI_WIN_ORDER will help users express their intent to order RMA operations without the need to wait for immediate remote completion, potentially providing better communication efficiency on some architectures. Users will also be able to use MPI_WIN_ORDER to issue a memory fence to order local or shared memory accesses.

References

Discussion in the RMA-WG: https://github.com/mpiwg-rma/rma-issues/issues/10 The discussion included an example of how fences in OpenShmem may improve the latency of two RMA operations ordered using either quiet (flush) or fence: https://github.com/mpiwg-rma/rma-issues/issues/10#issuecomment-487629268

Documentation of UCX ucp_worker_fence

Description of GNI_RDMAMODE_FENCE in the uGNI reference

Intel Intrinsics Guide, documentation of _mm_[l|s|m]fence

omor1 commented 5 years ago

This capability would definitely be useful—I've fairly recently run into a situation with a lockless distributed work-stealing algorithm where I needed to use MPI_WIN_FLUSH to ensure ordering between atomic operations to different memory locations. A order/fence operation would have been exactly what I needed.

Do the underlying communication fabrics expose more fine-grained ordering semantics? In some algorithms a full barrier may not be necessary. I'm thinking something equivalent to C11/C++11 relaxed/consume/acquire/release/sequential consistency when using atomic types.

Is this intended to be useable with both active and passive target synchronization? Fine-grained atomic operations that might need ordering are more likely to be used with the passive mode I think.

omor1 commented 5 years ago

Looking at the actual proposal document I see it differentiates between read/write/data orderings. I'm not entirely convinced of the value of MPI_WIN_ORDER_DATA—shouldn't one just use atomics?

devreal commented 5 years ago

@omor1

Do the underlying communication fabrics expose more fine-grained ordering semantics?

From what I have found so far no: uGNI, ibverbs, UCX all provide a single type of fence. I'm not an expert on networks though. The proposal does differentiate between data dependencies and independent reads and writes.

I'm thinking something equivalent to C11/C++11 elaxed/consume/acquire/release/sequential consistency when using atomic types.

I was thinking about acquire/release semantics as well but there was a desire in the RMA-WG to allow the ordering of reads and writes to be decoupled, which is not the case for acquire/release.

Is this intended to be useable with both active and passive target synchronization?

It is, yes.

Looking at the actual proposal document I see it differentiates between read/write/data orderings. I'm not entirely convinced of the value of MPI_WIN_ORDER_DATA—shouldn't one just use atomics?

That has been a point of contention in the WG as well. IMO, I should not have to use atomics if I don't need atomic semantics, e.g., if I know there won't be conflicting updates. It has also been pointed out that atomic reads/writes may be more costly on some networks. When using MPI RMA as a communication back-end for higher-level abstractions we have no control over what the user is doing so having a way to order conflicting non-atomic reads and writes comes in handy.

omor1 commented 5 years ago

I was thinking about acquire/release semantics as well but there was a desire in the RMA-WG to allow the ordering of reads and writes to be decoupled, which is not the case for acquire/release.

I think the semantics of read/write ordering are actually similar to those of atomic_thread_fence with memory_order_acquire or memory_order_release, respectively, in C11/C++11. I agree it's not quite the same as an acquire or release load or store, which in my experience I've seen used more often. atomic_thread_fence has stricter synchronization requirements.

That has been a point of contention in the WG as well. IMO, I should not have to use atomics if I don't need atomic semantics, e.g., if I know there won't be conflicting updates. It has also been pointed out that atomic reads/writes may be more costly on some networks. When using MPI RMA as a communication back-end for higher-level abstractions we have no control over what the user is doing so having a way to order conflicting non-atomic reads and writes comes in handy.

Doesn't MPI_WIN_ORDER_DATA only apply to conflicting updates? I suppose if these are done in phases or batches then they don't need individually to be atomic; I think the RMA-WG issues page had such an example.

devreal commented 5 years ago

I think the semantics of read/write ordering are actually similar to those of atomic_thread_fence with memory_order_acquire or memory_order_release, respectively, in C11/C++11. I agree it's not quite the same as an acquire or release load or store, which in my experience I've seen used more often. atomic_thread_fence has stricter synchronization requirements.

It would of course be possible to add MPI_WIN_ORDER_ACQUIRE and MPI_WIN_ORDER_RELEASE for the sake of semantic compatibility with high-level abstractions.

Doesn't MPI_WIN_ORDER_DATA only apply to conflicting updates?

That is all it does, yes. The reasoning behind MPI_WIN_ORDER_DATA is that in shared memory the implementation likely won't have to issue a memory fence to fulfill this ordering because the CPU takes care of it (well, unless operations to the same shared memory target are in flight in the NIC, in which case those have to complete before returning from MPI_WIN_ORDER). For inter-node communication the implementation will have to take care of the ordering, e.g., by adding a fence flag to the next communication operation. Without MPI_WIN_ORDER_DATA, any call to MPI_WIN_ORDER would require a memory fence in shared memory though.

I suppose if these are done in phases or batches then they don't need individually to be atomic; I think the RMA-WG issues page had such an example.

I'm not sure what you mean by "phases or batches". From the view of a PGAS runtime that uses MPI RMA there is no control whatsoever over the RMA communication pattern the application exhibits. The behavior has to be assumed to be random reads and writes, which basically requires a flush after each operations to ensure sequential consistency. The use of MPI_WIN_ORDER_DATA would boil this down to one flush per blocking get (or before any non-RMA synchronization).

To lend credence to the claim that atomic operations may be slower than put/get: I'm observing significant differences for larger transfers comparing accumulate+replace and a put (followed by a flush): with 2M 64bit integers (16MiB) the difference in throughput is >2x on Aries and 15% on a ConnectX-3 cluster in favor of put (measured with Open MPI 3.1.2). So using atomic operations just to get the ordering comes at a cost and MPI_WIN_ORDER_DATA provides a higher degree of control and flexibility.

omor1 commented 5 years ago

It would of course be possible to add MPI_WIN_ORDER_ACQUIRE and MPI_WIN_ORDER_RELEASE for the sake of semantic compatibility with high-level abstractions.

I don't think we need to bikeshed the naming (and I'm actually happier with MPI_WIN_ORDER_READ and MPI_WIN_ORDER_WRITE, it's more obvious what's going on).

I'm not sure what you mean by "phases or batches".

I meant that one does e.g. a bunch of puts to disparate non-overlapping memory locations, some synchronization operation, followed by a bunch of puts to memory locations overlapping the first batch. Something like @pavanbalaji's example.

If using atomic operations just for ordering (e.g. MPI_ACCUMULATE with MPI_REPLACE or MPI_GET_ACCUMULATE with MPI_NO_OP) then using MPI_PUT/MPI_GET with MPI_WIN_ORDER is absolutely a performance win. Whether or not it's as useful when using atomic operations for their atomicity (i.e. some other operation, e.g. MPI_SUM) is less straightforward, though it could potentially still be a win when the accumulate_ordering is set to something weaker than the default.

devreal commented 5 years ago

Updated the proposal after taking a critical look at it again. It's mostly trying to rephrase some ambiguous aspects and (most notably) removing the possibility to combine MPI_WIN_ORDER_* options and instead treat them as discrete values to avoid some semantic issues.

The PR has been updated accordingly. Updated annotated PDF for the reading at the virtual meeting on July 3: mpi-report_issue-135_2019-06-25_14.54.pdf

artpol84 commented 5 years ago

As we discussed with @devreal offline, I would like to bring a threading aspect into the discussion. It would be nice to define the scope of the Win_order: process-wide or thread-wide. The way it is right now I believe it will be a process-level scope. This will lead to a very heavyweight operation for the implementations that use per-thread HW resources to perform RMA operations.

For example, in Open MPI development branch, we have 2 UCX-based implementations that will use multiple "worker" instances to improve the operation rate. Because of the Win_flush operation, both have to lock those workers to preserve consistency. The same would have to be done for the Win_order if it is process-wide.

There is threading a discussion in https://github.com/mpi-forum/mpi-issues/issues/117 where the semantics of "local concurrency" is considered. There are alternative interpretations existing (https://github.com/mpi-forum/mpi-issues/issues/117#issuecomment-445895201).

One of the options considers operations happening in the concurrent threads to be "logically concurrent" even if they are actually ordered by the inter-threading synchronization. If the forum will go this route it will/should affect this proposal as well, because it will complicate the semantics of the order between threads.

Just a random thought on this:

MPI_Win_order may evolve into the MPI_Order (or MPI_P2p_order ???)routine that will serve point-to-point ordering enforcement as well as RMA (though the semantics would be different). This would help MPI users to get some guarantees about orders of point-to-point operations when they need it (if the forum will follow more flexible p2p alternative)
MPI_Win_order should be considered thread-local (I'd prefer it this way) and only order operations issued by the current thread.
MPI_Win_order should be considered process-wide.

omor1 commented 5 years ago

@artpol84 you bring up a good point. Perhaps that can be specified with an assert on the operation or on the info key for the window? There are potential use-cases for process-wide ordering that we may want to support.

devreal commented 5 years ago

I think the problem @artpol84 is referring to is that even the potential of requesting cross-thread RMA operation ordering may have performance implications. It could 1) prevent threads from using their own queue; or 2) require MPI_Win_order to perform a flush on all thread-local queues for process-wide operation ordering, rendering it rather useless in that case.

In contrast to a flush, the main use-case of MPI_Win_order is probably thread-local operation ordering. However, the MPI standard currently does not have the notion of thread-local afaics as the process is considered the communicating entity, not the thread. Thread-parallel MPI operations threads are just said to occur in some order within the process.

The same question of thread-local vs process-wide should be discussed for flush and I think @hjelmn has been working in that direction.

hjelmn commented 5 years ago

Indeed. I intend to start discussing adding a thread-local flush at the December meeting. I won't be in Zurich or I would present it there.

artpol84 commented 5 years ago

Do you want to work together on it? I’ll be able to present on your behalf.

hjelmn commented 5 years ago

@artpol84 Sure, that would be great. I have a working prototype implemented in btl/uct, btl/ugni, and osc/rdma. The thread-local completion gives a decent (10-20%-- been awhile since I ran it) boost for the rma-mt small message rate on Cray Aries @ 32 threads and 1 device context per thread. I have time to spend on the proposal starting next week.

artpol84 commented 5 years ago

Great If you can share the prototype we probably can run it through on our hardware to demonstrate the impact on multiple platforms.

artpol84 commented 5 years ago

BTW, what was the benchmark where you got this speedup? I'd expect a higher improvement based on my observations for our MT.Comb benchmark (https://github.com/Mellanox/MT.ComB). It has the mode when you run directly on top of UCX/ucp and each thread is doing a separate flush. I think the performance for this case had the same order of magnitude as NICs hardware capabilities. Do you still have some synchronization requirements in your POC?

artpol84 commented 5 years ago

@devreal, I had an internal discussion with my colleagues and here are some more comments.

According to the motivation statement:

At the same time, modern high-performance networks (e.g., InfiniBand, Cray Aries), communication libraries (UCX), and other PGAS abstractions such as OpenShmem provide ways for inserting ordering requests into a stream of RMA operations, commonly called fences.

But this holds only for the common configurations using a single-channel (i.e. QP). For a general case, it is not true. For example:

Following existing software optimization techniques that operate on multiple channels will require SW-level ordering (most likely a flush):
- multi-rail
- per-thread channels (as per previous discussion)
- use of separate channels for atomics and puts.
Hardware-level features:
- Adaptive routing
- PCI atomics are not ordered with respect to regular writes.

artpol84 commented 5 years ago

Regarding the example here it looks like a good use-case for OSHMEM put-with-signal (see the proposal).

If multiple puts are required, another OSHMEM variation of the put-with-count can be used to make sure that all put's are completed.

Those features can be efficiently implemented in HW and should help the problem.

devreal commented 5 years ago

@artpol84 Thanks a lot for the valuable feedback, there seem to be some issues that need to be addressed.

Following existing software optimization techniques that operate on multiple channels will require SW-level ordering (most likely a flush):

multi-rail

per-thread channels (as per previous discussion)

use of separate channels for atomics and puts.

The per-thread channels are an issue that we can probably address in the specification of MPI_WIN_ORDER, i.e., by specifying that it only applies to a stream of operations issued by the same thread. That, of course, depends on the outcome of the discussion on thread-local RMA synchronization.

For the other two optimizations there seems to be a trade-off in two dimensions: throughput and the ability to use hw- vs sw-level ordering. This could be addressed using an info key on the window that lets the user prioritize (something like opt_priority set to either throughput or synchronization, I'm sure there are use-cases for both options). Would that help address these concerns? Please let me know if my understanding of this is still too simple :) IMHO, the potential of optimizations on some platforms should not discard this feature entirely but these issues need to be addressed in some way. I will work on the proposal text later this week and incorporate something along these lines.

Hardware-level features:

Adaptive routing

PCI atomics are not ordered with respect to regular writes.

I am not sure why these are issues specific for MPI. Shouldn't these be issues for UCX and other underlying network abstractions providing the fence semantic in the first place?

Regarding the example here it looks like a good use-case for OSHMEM put-with-signal

I was not aware of the put-with-signal proposal in OSHMEM and it indeed covers some of the use-cases for ordered RMA operations. I understand that this is easier to implement and less heavy than a fence but it does not cover all the use-cases of a fence. It seems to be a very specific case to justify introducing new MPI functionality for it (although I am not saying that having them wouldn't be valuable).

artpol84 commented 5 years ago

@devreal

Hardware-level features:

Adaptive routing

PCI atomics are not ordered with respect to regular writes.

I am not sure why these are issues specific for MPI. Shouldn't these be issues for UCX and other underlying network abstractions providing the fence semantic in the first place?

The point there was that the claim in the initial statement

At the same time, modern high-performance networks (e.g., InfiniBand, Cray Aries), communication libraries (UCX), and other PGAS abstractions such as OpenShmem provide ways for inserting ordering requests into a stream of RMA operations, commonly called fences.

Does not hold for the general case and there are a lot of scenarios where MPI_Win_order will be a duplication of MPI_Win_flush. And you cannot consider that even for InfiniBand fence will be a no-op (i.e. in case of Adaptive routing and/or PCI atomics). I was also told that for Cray Aries the fence is not lightweight, though we need to confirm this with Cray.

Having said that:

Does that makes sense to introduce this primitive that will have limited applicability in the light of upcoming technologies?
Can we do something lighter than that at the API level?

devreal commented 5 years ago

@artpol84 I agree that it does not make sense to introduce functionality that cannot work the way it's intended to.

Can we do something lighter than that at the API level?

I'm sure the answer is yes :) However, finding a better abstraction needs details on the parameters of the hardware that set the limits on operation ordering, i.e., what mix of operations can be efficiently ordered and under which circumstances? What level of control does the software layer have over the features that limit ordering? What low-level operations does the hardware offer? If all we can do is to have a transfer signalling mechanism such as OSHMEM's put-with-signal then that's still better than nothing but maybe there is a more general abstraction possible? Can you point me to some resources that lay out these parameters on the Mellanox side? At least in the PR for put-with-signal there is little information afaics.

jeffhammond commented 1 year ago

Currently, the only way to enforce ordering of arbitrary MPI RMA operations is to synchronize and/or wait for (remote) completion of all previously issued operations before issuing new operations.

Ordered Put = MPI_Accumulate(MPI_REPLACE) Ordered Get = MPI_Get_accumulate(MPI_NO_OP)

No synchronization is required to realize, e.g. the ARMCI "location consistency" ordering semantic.

Ordered accumulate is the default although it can be relaxed.

devreal commented 1 year ago

The problem with accumulate ordering is that it applies only to operations on the same variable. The goal of this proposal was to provide ordering between operations to different memory locations without blocking on a flush. I don't think this proposal will go anywhere but it's in the back of my mind for our future work.

mpi-forum / mpi-issues

Add MPI_WIN_ORDER to explicitly order RMA operation #135

Problem

Proposal

Changes to the Text

Impact on Implementations

Impact on Users

References