Expose hardware fence capabilities

devreal commented 5 years ago

Currently the MPI standard does not provide any ordering guarantees for puts and gets unless remote completion is performed in between two operations. For atomic operations, only accesses to same/overlapping memory regions are guaranteed to be ordered (if origin and target are the same).

As far as I can see, many (most?) high-performance networks provide the notion of a fence to inject an ordering request into a stream of operations, i.e., to ensure that one operation is completed before the next is performed without having to wait for remote completion at the origin.

Example:

a = 0;
PUT(1->win, 1);
FENCE();
GET(1->win, &a);
FLUSH();
assert(a == 1);

This may be more efficient than having two flushes because we only have one round-trip instead of two, which is esp. useful for latency-bound use-cases (single-value atomics/reads/writes) and to ensure memory consistency in frameworks built on top of MPI RMA.

Libraries such as OpenShmem and UCX both offer this functionality.

Hence my question: has this been discussed (and dismissed?) in the past? Is there a reason not to add something like MPI_Win_order that triggers a hardware fence if available? (since MPI_Win_fence is already taken) On networks that don't support fences in hardware, implementations may always fall-back to a full flush, which is what the user would do otherwise at the moment.

jeffhammond commented 5 years ago

I would just write this:

ACCUMULATE(op=REPLACE) GET_ACCUMULATE(op=NOOP)

This is ordered and implementation uses the best method it can for ordering. If there’s a hardware fence operation (there almost certainly isn’t) then it can inject that after the ACCUMULATE or before the GET_ACCUMULATE.

pavanbalaji commented 5 years ago

No, the proposed approach is more flexible. For example,

for (1000 loops)
   MPI_Put
MPI_Win_order
for (1000 loops)
   MPI_Put

... can be more efficient than simply using accumulates, which has to order all operations.

hjelmn commented 5 years ago

I agree with this proposal. This brings me back to one of my pet peeves about MPI_Win_fence(). It is a fence + barrier operation. Wish it had been deprecated in MPI-3.0.

The question is what do we call this new operation given the obvious name is taken?

jeffhammond commented 5 years ago

No, the proposed approach is more flexible. For example,

for (1000 loops) MPI_Put MPI_Win_order for (1000 loops) MPI_Put Can the difference between order and flush be observed reliably relative to 2000 RMA operations?

I’m not sure you’ll even be able to see a difference in shared memory with an appropriate weakly ordered CPU like POWER. How many cycles difference between msync and mfence (or whatever the instructions are called)? How many cycles for loads from memory (not cache)?

To be clear, we should not add features to the MPI standard to optimize SHMEM microbenchmarks.

... can be more efficient than simply using accumulates, which has to order all operations.

It has to order the operations that are used on the window in question. Add info key to subset operations per window if it will help.

pavanbalaji commented 5 years ago

I think for 1000 operations, you'll surely see some difference, at least with MPICH. If not, you can replace that with 100 or 500 or whatever number makes sense for your network. Moreover, for networks that natively provide ordered communication, the difference is even higher. The point is very simple: accumulate gives ordering for all operations; this proposal gives more fine-grained ordering of such ordering.

One aspect that the proposal does not explicitly mention (but I think is implied) is that of atomicity. Presumably, the operations after the MPI_Win_order operation are targeting the same memory locations as after the MPI_Win_order (otherwise you wouldn't need ordering anyway). So this proposal allows the user to tell MPI that everything before MPI_Win_order is nonatomic.

devreal commented 5 years ago

There is another use-case for fences that may be more intuitive and cannot be realized with the ordering guarantees of RMA atomics per-se: any algorithmic requirement for the order in which different variables should be modified. An example I'm currently working on is a signalling scheme where a process writes to one or more variables on another process and eventually sets a flag to signal completion of the modifications:

Process 0:

PUT(1->win, val_offset, val);
FENCE(1->win);
PUT(1->win, flag_offset, 1);
FLUSH(1->win);

Process 1 (waiting for Process 0 to complete, flushes omitted):

while (GET(1->win, flag_offset) != 0) { }
GET(1->win, val_offset);

This can easily be extended to more complex scenarios (where you couldn't just replace the sync through flag_offset with a send/recv or barrier). If something like the above code is done in a critical part of the application repeatedly, the latency of a second flush (instead of the fence) on Process 1 really adds up.

AFAICS, a flush with remote completion should really only be required in two scenarios:

1) If there are data dependencies between two operations at the origin, i.e., operation Op2 has the result of a previous Op1 as input (e.g., fetching an offset in Op1 to write to in Op2); or 2) If completion at the target has to be ensured before some out-of-band (read: non-RMA) signalling is done, e.g., before entering a barrier or sending a message.

I will try to come up with a benchmark to see whether there is a significant difference (I'd be surprised/disappointed if there wasn't, what's a fence for then otherwise? ^^).

jeffhammond commented 5 years ago

@devreal You're reimplementing the SHMEM send-recv pattern, which is best implemented in MPI using MPI_Send and MPI_Recv. Put-with-notification paired with spinning on a read is overrated as a message-passing implementation.

I can't remember right now, but I also thought that it is undefined whether MPI_Get can read a memory location that may be updated concurrently. When I write such code, I always use MPI_Fetch_and_op(NO_OP).

jeffhammond commented 5 years ago

@pavanbalaji I meant whether you'd notice fence vs flush in the middle of 1000 RMA ops. There are two scenarios to optimize for: latency and throughput. Accumulate ops probably minimize latency whereas non-atomic RMA plus flush optimizes for throughput. Sure, they aren't trivially interchangeable inside of OSHMPI, but a proper MPI application can use both depending on need.

devreal commented 5 years ago

@jeffhammond As I said, that is a trivial example. Consider all processes doing similar data exchange with random targets. Blocking send/recv won't get you very far because you easily end up with a pair of processes stuck in send. Once you use non-blocking sends and recvs you start spending quite some time testing for completed requests. I'm working on a paper that explores the merits of using RMA for this type of communication. Latency is not one of them (at least in its current state), throughput actually is on some systems.

Apart from that, in the PGAS framework I'm working on (DASH; based on MPI RMA as a communication backend) we are forced to finish pretty much every put with a remote flush (except if the users know what they are doing an tell us so) because the next put or get might point to the same remote memory location and we have to ensure correct ordering. Tracking accesses to individual global memory locations is not an option for us. Having support for ordered RMA without forcing remote completion would be tremendously helpful. Sure, serializing RMA accesses that way might incur some performance penalty compared to non-ordered puts (additional function call, serialized processing, ordering overhead) but I expect it to be significantly faster than what we have to do right now. (of course we can discuss whether single-element global memory access is worth it after all but that's beyond the point here)

Plus, many systems offer this feature so why not provide access to it?

I can't remember right now, but I also thought that it is undefined whether MPI_Get can read a memory location that may be updated concurrently.

The result in undefined in that you can end up with reading partial updates. That shouldn't be a problem if you flip a value from from zero to one. And the actual implementation indeed uses atomics because there may be more than one writer. The ordering issue I tried to show remains though.

jdinan commented 5 years ago

@devreal Having spent a fair amount of time working on both MPI RMA and OpenSHMEM, I can tell you that (1) fence is a concept worth having in MPI RMA and (2) the performance benefit is network dependent. On ordered networks like InfiniBand, the fence is effectively a no-op (yay). On most unordered networks, the fence is effectively the same thing as a flush (oh...). If you happen to have triggered ops, you can do something more creative, but there are few such networks.

I'm all for adding this to MPI RMA, but bear in mind the performance caveats -- if you have a network that is good at PGAS (i.e. reliable, unordered datagrams) there's a good chance it will be bad at fence.

FWIW, you can use MPI accumulates if you want ordering. This works fine for scalar operations, but most networks have lower throughput for vector atomics versus put/get.

jeffhammond commented 5 years ago

We let RMA users turn off ordering in accumulate. Why not let them turn off atomicity as well? We'd have to change a nontrivial amount of text but the end result would be what appears to be desired here, and perhaps a more efficient implementation of it, since it would not require an additional function call to prescribe ordered put and get.

Feature	ordered	unordered
atomic	accumulate ops	accumulate ops + info
not atomic	put/get + flush	put/get

Performance	ordered	unordered
atomic	🙂	🤔
not atomic	😕	🙂

pavanbalaji commented 5 years ago

What does ordering mean without atomicity for a single memory location?

I think @devreal's use case is for multiple separate memory locations. So, in some sense, it's orthogonal to PUT/GET vs. ACC/GACC.

pavanbalaji commented 5 years ago

I think the point of this proposal is that FLUSH does two things: (1) ensures operations after the FLUSH are ordered with operations before FLUSH and (2) makes sure the data is visible to other processes. This proposal is to decouple them, so the user can get just (1) without (2), if needed. This is independent of ACC/GACC vs. PUT/GET.

jdinan commented 5 years ago

Thanks, I had missed that. This is much stronger than what shmem_fence guarantees. shmem_fence only guarantees that the targeted process will see the operations in order. That is, point-to-point, not global ordering.

pavanbalaji commented 5 years ago

I believe this proposal is point-to-point ordering too, i.e., a subset of FLUSH, not FLUSH_ALL.

jeffhammond commented 5 years ago

What does ordering mean without atomicity for a single memory location?

It means that the updates from a single source have a well-defined ordering, which produces the same results as ordered atomic updates because the updates are not concurrent. It does not imply that updates from multiple sources are defined, in contrast to atomic operations.

This behavior is equivalent to non-atomic store instructions that execute in-order.

I think @devreal's use case is for multiple separate memory locations.

That isn't stated so I cannot assume it. It is completely reasonable to want updates that appear in order but are not atomic when the user knows that any element is only updated by a single source.

pavanbalaji commented 5 years ago

What does ordering mean without atomicity for a single memory location?

It means that the updates from a single source have a well-defined ordering, which produces the same results as ordered atomic updates because the updates are not concurrent. It does not imply that updates from multiple sources are defined, in contrast to atomic operations.

Note that I said to a single memory location. In that case, the data will be corrupted. Why would one need ordering if the data is corrupted anyway?

This behavior is equivalent to non-atomic store instructions that execute in-order.

FWIW, store instructions are ordered only on some architectures, such as x86. But that's also an orthogonal discussion.

I think @devreal's use case is for multiple separate memory locations.

That isn't stated so I cannot assume it. It is completely reasonable to want updates that appear in order but are not atomic when the user knows that any element is only updated by a single source.

I was just clarifying what I believe to be the intent of the proposal based on the discussion. You are right that the proposal is not defined well enough to actually confirm that.

devreal commented 5 years ago

@jeffhammond @pavanbalaji You're right, I didn't mean to put up a full proposal but rather meant to ask whether this topic is of interest (it seems at least controversial ^^).

I put together a proposal as an extension to the MPI standard document (page 476, Ctrl+F for "ticket10").

Let me clarify what semantics I want to achieve. From the proposal:

MPI_WIN_ORDER ensures that all previously issued RMA operations initiated by the calling process to the target rank on the specified window complete at the target before operations issued following this call.

I want to cover both cases that were discussed here: 1) Ordering writes and reads to the same memory location at the same target (my original example). Yes, this can be done using atomics. However, my intent is not to have atomic semantics, i.e., I know there won't be concurrent writes.

2) Ordering writes to to non-overlapping memory locations at the same target (my second example: https://github.com/mpiwg-rma/rma-issues/issues/10#issuecomment-485698612). This cannot be done using atomic operations but would be possible using fence semantics.

Both cases can be solved using flushes waiting for remote completion. However, there are valid use-cases where only partial ordering is required but a flush with remote completion can be deferred until a later point when all operations have been posted. In essence, I can communicate my intent to the MPI implementation and let it decide what the best strategy to achieve what I want is on the current platform. If the best strategy happens to be a flush, so be it (see advice to implementors in the annotated standard).

Also to clarify: there is no notion of global ordering, the ordering is always only guaranteed for operations from the same origin to the same target (on the same window).

As promised I did some measurements using OpenShmem: two put operations to distinct memory locations on the same target PE, required to be ordered, completed by quiet:

      uint64_t val = 0;
      uint64_t* target = shmem_malloc(2*sizeof(uint64_t));
      shmem_long_put(target_ptr, &val, 1, target_pe);
      shmem_fence/quiet();
      shmem_long_put(target_ptr+1, &val, 1, target_pe);
      shmem_quiet();

The ordering is either guaranteed by a fence or quiet. I measured 100k repetitions and the values were reproducible across several runs.

On a Cray XC40 using Cray's shmem implementation, I do not observe any difference between fence and quiet. Unfortunately, I wasn't able to get both the OpenShmem reference implementation and Open MPI's shmem implementation to work within finite time (that is, Open MPI + UCX went up in flames and the reference implementation comes with a tail of dependencies I haven't bothered installing yet)

However, on a Mellanox ConnextX-3 cluster using Open MPI + UCX, the difference between fence and quiet is almost 1.8x, i.e., calling shmem_fence in between the two operations instead of shmem_quiet reduces the time for two operations from 4.1us to 2.2us.

That seems to confirm what @jdinan said about the differences in networks (or is it just that Cray's implementation is bad at fences?)

@jdinan I took a closer look at the OpenShmem standard, which explicitly states that non-blocking get operations are not included in the ordering guarantees of a fence and that a fence only guarantees delivery, not completion. I don't see these limitations explicitly stated for UCX or ibverbs. Any insight into why OpenShmem provides weaker guarantees here?

jeffhammond commented 5 years ago

What does ordering mean without atomicity for a single memory location?

It means that the updates from a single source have a well-defined ordering, which produces the same results as ordered atomic updates because the updates are not concurrent. It does not imply that updates from multiple sources are defined, in contrast to atomic operations.

Note that I said to a single memory location. In that case, the data will be corrupted. Why would one need ordering if the data is corrupted anyway?

This comment makes no sense. If writes are ordered, data is not corrupted, because they happen in a well-defined sequence. Obviously, this only works for a single source, but that is what we are discussing.

jeffhammond commented 5 years ago

@devreal I believe that Cray's implementation of fence is quiet, because they use end-to-end rather than link-level reliability. This is merely speculation however - only somebody from Cray can speak with authority on what their proprietary blobs do.

pavanbalaji commented 5 years ago

If writes are ordered, data is not corrupted, because they happen in a well-defined sequence.

That's a reasonable requirement. If that was what was proposed, then I missed it.

jdinan commented 5 years ago

@devreal OpenSHMEM fence is intended to be a write fence. Ordering for gets remains an open topic

devreal commented 5 years ago

@jdinan Is that because it makes the implementation of fence easier/feasible on more architectures? Should we carry this restriction over to MPI?

jeffhammond commented 5 years ago

No, we should define order in a way that meets MPI’s goals of completeness and orthogonality. SHMEM is full of historical artifacts that only made sense on the Cray T3E or which don’t support the full range of one-sided communication usage models.

jdinan commented 5 years ago

It's common to be able to order/complete reads and writes separately. I don't think separating them is harmful, as long as it doesn't leave gaps in the memory model. There is an assumption in SHMEM that blocking Gets are ordered (because they complete in program order). If you implement Get operations using ld/st instructions on a weakly ordered architecture, the operations may not be ordered. This has forced some implementations to add a read fence to Gets, which is bad for performance.

jeffhammond commented 5 years ago

@jdinan Yes, these should be separable. Something like MPI_Win_fence assertions should do the trick.

devreal commented 5 years ago

@jdinan @jeffhammond I was just about to add an assert parameter that accepts something like MPI_MODE_NOGET to limit the ordering to puts and atomics. However, Section 11.5.5 states:

The assert argument does not change program semantics if it provides correct information on the program

AFAICS, the assert MPI_MODE_NOGET would change program semantics. Turning things around and allowing the user to pass MPI_MODE_NOPUT ~and/or MPI_MODE_NOACC (tbd)~ to MPI_WIN_ORDER to signal that there have not been preceding modifications would allow the implementation to avoid the read fence and still provide ordering with succeeding puts. It's not the same as limiting MPI_WIN_ORDER to put and atomics.

As an alternative, we could add a fence_type parameter (similar to lock_type for MPI_Win_lock) that determines whether MPI_WIN_ORDER includes get or not. In that case we could have MPI_ORDER_FULL and MPI_ORDER_NOGET. Unfortunately, MPI already defines MPI_ORDER_C and MPI_ORDER_FORTRAN so we have to find another namespace...

pavanbalaji commented 5 years ago

The semantics of these asserts need be clearly stated. Does MPI_MODE_NOGET mean that I cannot use MPI_GET in my program, or does it mean that the MPI_GET calls are not ordered? If it's the latter, is that only for ordering with respect to other MPI_GET operations or all operations?

As an alternative approach, would MPI_WIN_IFLUSH solve the same issue?

devreal commented 5 years ago

The semantics of these asserts need be clearly stated. Does MPI_MODE_NOGET mean that I cannot use MPI_GET in my program, or does it mean that the MPI_GET calls are not ordered? If it's the latter, is that only for ordering with respect to other MPI_GET operations or all operations?

MPI_MODE_NOGET would mean that the ordering request does not apply to preceding and succeeding MPI_GET calls but only affect put and atomic operations. The user is still allowed to use MPI_GET before and after the call. I'm not sure if there is much use for ordering MPI_GET operations independent of put/atomic operations. MPI_MODE_NOGET isn't strictly an assertion but a request, something I didn't immediately think about. Alternatively, MPI_MODE_NOPUT could be used to signal that there have not been conflicting put/accumulate operations that would require a memory read fence on some architectures. Maybe that is enough already?

Would MPI_WIN_IFLUSH trigger a flush and return a request that can be used to wait for all previously posted operations to complete? Would it imply ordering relative to operations posted after the call to MPI_WIN_IFLUSH? In principle, this seems like an interesting concept, also in light of the discussion on request-based atomic operation (https://github.com/mpi-forum/mpi-issues/issues/128). It seems a bit heavier and involves more work on the user-level due to request handling though.

pavanbalaji commented 5 years ago

If we need to order future GETs with previous PUTs, for example, does this proposal require us to force ordering for all operations?

devreal commented 5 years ago

@pavanbalaji I'm not sure I fully understand your question. Do you mean whether or not the proposal requires ordering of non-conflicting PUTs and GETs? In its current form yes: all previous PUTs would have to complete at the target before any future GETs may read their value from the target. The current wording is mostly borrowed from OpenShmem, which doesn't include GETs so that hasn't been an issue there...

I guess this can be relaxed to only require ordering of conflicting PUTs and GETs. After all, GETs do not observe the outcome of non-conflicting PUTs...

hjelmn commented 5 years ago

Should be make it a flag what the fence applies to? Make the flags bitwise MPI_ORDER_READ, MPI_ORDER_WRITE. That would be useful for shared memory.

pavanbalaji commented 5 years ago

@devreal My point is that your proposed hints do not have read-after-write or write-after-read ordering. They only have write-after-write (i.e., put-order-put) and read-after-read (i.e., get-order-get). I don't know if they are needed, but I'm just calling out the holes so we can make an explicit decision to either add them or ignore them.

devreal commented 5 years ago

@hjelmn Is it a problem that MPI_ORDER_C already exists, i.e., that there is somewhat of a namespace clash?

@pavanbalaji @hjelmn Is it sufficient to control whether GETs are included in the ordering (as per @hjelm's suggestion) or is a more fine-grained control similar to the accumulate_ordering info key more helpful (bitwise combinable MPI_ORDER_RAW, MPI_ORDER_WAR, MPI_ORDER_WAW, MPI_ORDER_RAR)? Also, would passing MPI_ORDER_READ or MPI_ORDER_RAR alone be anything else but a noop?

pavanbalaji commented 5 years ago

I generally like having more control on the actual guarantees provided (so more options). But I was merely asking the question, not suggesting you change your proposal to be one way or another.

devreal commented 5 years ago

@pavanbalaji I understand, and your question pointed at something I hadn't considered :) The current wording with regard to GETs is too strict and there seems to be a common interest in adding fine-grained control. I'd like to get to a consensus on how that control should look like :)

pavanbalaji commented 5 years ago

Also, would passing MPI_ORDER_READ or MPI_ORDER_RAR alone be anything else but a noop?

I don't think this would be a no-op. Consider this, admittedly convoluted, example:

All processes create a window with no accumulate ordering. Initial value of X (located on process P2) is 0.

P0:
    a = GET_ACCUMULATE(x, NO_OP)
    WIN_ORDER(RAR)
    b = GET_ACCUMULATE(x, NO_OP)
    WIN_FLUSH
    if (b == 0)
        assert(a == 0);

P1:
   ACCUMULATE(x = 1)
   WIN_FLUSH

devreal commented 5 years ago

After skimming through the documentation of existing memory fences I updated the proposal to include an order_type (as proposed by @hjelmn) that is the bit-wise OR combination of one or more of MPI_WIN_ORDER_READ, MPI_WIN_ORDER_WRITE, and MPI_WIN_ORDER_DATA, plus MPI_WIN_ORDER_FULL, which is a shorthand for all three.

MPI_WIN_ORDER_READ requires completion of all previously issued reads at the target before reads issued following the call. This is similar to an lfence.

MPI_WIN_ORDER_WRITE requires completion of all previously issued RMA writes at the target before writes issued following the call. This is similar to an sfence.

MPI_WIN_ORDER_DATA requires the ordering of reads and writes on overlapping memory regions, similar to the default for accumulate operations but including non-atomic operations. This should not need special instructions on most shared memory architectures.

MPI_WIN_ORDER_FULL is similar to an mfence.

For inter-node communication, all ordering types will require a fence in network hardware (where available) or some other form of ordering mechanism.

The proposal now also contains two examples, a rationale for the order types, and an adapted Section 11.7.2 (Ordering): mpi-report.pdf

Limitations

The ordering can only be enforced on the same window, not across window boundaries to the same target process.
Load/store operations to local window memory (performed by the application) are explicitly excluded from the ordering. I'm not quite sure whether that is reasonable but my intuition tells me that otherwise additional memory barriers might be required.

Any feedback would be much appreciated :)

devreal commented 5 years ago

I updated the draft document: mpi-report-memalign.pdf

I am not sure about the exclusion of loads and stores from the ordering. It might be helpful for users of shared memory windows but would prevent implementations from deferring the memory fence to the next put/get to/from shared memory targets. Opinions?

Unless there are objections or concerns about the current design, I'd like to put this into a PR in time for a first reading in Chicago. Please let me know if you think that's premature.

mpiwg-rma / rma-issues

Expose hardware fence capabilities #10