Open devreal opened 5 years ago
I would just write this:
ACCUMULATE(op=REPLACE) GET_ACCUMULATE(op=NOOP)
This is ordered and implementation uses the best method it can for ordering. If there’s a hardware fence operation (there almost certainly isn’t) then it can inject that after the ACCUMULATE or before the GET_ACCUMULATE.
No, the proposed approach is more flexible. For example,
for (1000 loops)
MPI_Put
MPI_Win_order
for (1000 loops)
MPI_Put
... can be more efficient than simply using accumulates, which has to order all operations.
I agree with this proposal. This brings me back to one of my pet peeves about MPI_Win_fence(). It is a fence + barrier operation. Wish it had been deprecated in MPI-3.0.
The question is what do we call this new operation given the obvious name is taken?
No, the proposed approach is more flexible. For example,
for (1000 loops) MPI_Put MPI_Win_order for (1000 loops) MPI_Put Can the difference between order and flush be observed reliably relative to 2000 RMA operations?
I’m not sure you’ll even be able to see a difference in shared memory with an appropriate weakly ordered CPU like POWER. How many cycles difference between msync and mfence (or whatever the instructions are called)? How many cycles for loads from memory (not cache)?
To be clear, we should not add features to the MPI standard to optimize SHMEM microbenchmarks.
... can be more efficient than simply using accumulates, which has to order all operations.
It has to order the operations that are used on the window in question. Add info key to subset operations per window if it will help.
I think for 1000 operations, you'll surely see some difference, at least with MPICH. If not, you can replace that with 100 or 500 or whatever number makes sense for your network. Moreover, for networks that natively provide ordered communication, the difference is even higher. The point is very simple: accumulate gives ordering for all operations; this proposal gives more fine-grained ordering of such ordering.
One aspect that the proposal does not explicitly mention (but I think is implied) is that of atomicity. Presumably, the operations after the MPI_Win_order
operation are targeting the same memory locations as after the MPI_Win_order
(otherwise you wouldn't need ordering anyway). So this proposal allows the user to tell MPI that everything before MPI_Win_order
is nonatomic.
There is another use-case for fences that may be more intuitive and cannot be realized with the ordering guarantees of RMA atomics per-se: any algorithmic requirement for the order in which different variables should be modified. An example I'm currently working on is a signalling scheme where a process writes to one or more variables on another process and eventually sets a flag to signal completion of the modifications:
Process 0:
PUT(1->win, val_offset, val);
FENCE(1->win);
PUT(1->win, flag_offset, 1);
FLUSH(1->win);
Process 1 (waiting for Process 0 to complete, flushes omitted):
while (GET(1->win, flag_offset) != 0) { }
GET(1->win, val_offset);
This can easily be extended to more complex scenarios (where you couldn't just replace the sync through flag_offset
with a send/recv or barrier). If something like the above code is done in a critical part of the application repeatedly, the latency of a second flush (instead of the fence) on Process 1 really adds up.
AFAICS, a flush with remote completion should really only be required in two scenarios:
1) If there are data dependencies between two operations at the origin, i.e., operation Op2
has the result of a previous Op1
as input (e.g., fetching an offset in Op1
to write to in Op2
); or
2) If completion at the target has to be ensured before some out-of-band (read: non-RMA) signalling is done, e.g., before entering a barrier or sending a message.
I will try to come up with a benchmark to see whether there is a significant difference (I'd be surprised/disappointed if there wasn't, what's a fence for then otherwise? ^^).
@devreal You're reimplementing the SHMEM send-recv pattern, which is best implemented in MPI using MPI_Send
and MPI_Recv
. Put-with-notification paired with spinning on a read is overrated as a message-passing implementation.
I can't remember right now, but I also thought that it is undefined whether MPI_Get
can read a memory location that may be updated concurrently. When I write such code, I always use MPI_Fetch_and_op(NO_OP)
.
@pavanbalaji I meant whether you'd notice fence vs flush in the middle of 1000 RMA ops. There are two scenarios to optimize for: latency and throughput. Accumulate ops probably minimize latency whereas non-atomic RMA plus flush optimizes for throughput. Sure, they aren't trivially interchangeable inside of OSHMPI, but a proper MPI application can use both depending on need.
@jeffhammond As I said, that is a trivial example. Consider all processes doing similar data exchange with random targets. Blocking send/recv won't get you very far because you easily end up with a pair of processes stuck in send. Once you use non-blocking sends and recvs you start spending quite some time testing for completed requests. I'm working on a paper that explores the merits of using RMA for this type of communication. Latency is not one of them (at least in its current state), throughput actually is on some systems.
Apart from that, in the PGAS framework I'm working on (DASH; based on MPI RMA as a communication backend) we are forced to finish pretty much every put with a remote flush (except if the users know what they are doing an tell us so) because the next put or get might point to the same remote memory location and we have to ensure correct ordering. Tracking accesses to individual global memory locations is not an option for us. Having support for ordered RMA without forcing remote completion would be tremendously helpful. Sure, serializing RMA accesses that way might incur some performance penalty compared to non-ordered puts (additional function call, serialized processing, ordering overhead) but I expect it to be significantly faster than what we have to do right now. (of course we can discuss whether single-element global memory access is worth it after all but that's beyond the point here)
Plus, many systems offer this feature so why not provide access to it?
I can't remember right now, but I also thought that it is undefined whether
MPI_Get
can read a memory location that may be updated concurrently.
The result in undefined in that you can end up with reading partial updates. That shouldn't be a problem if you flip a value from from zero to one. And the actual implementation indeed uses atomics because there may be more than one writer. The ordering issue I tried to show remains though.
@devreal Having spent a fair amount of time working on both MPI RMA and OpenSHMEM, I can tell you that (1) fence is a concept worth having in MPI RMA and (2) the performance benefit is network dependent. On ordered networks like InfiniBand, the fence is effectively a no-op (yay). On most unordered networks, the fence is effectively the same thing as a flush (oh...). If you happen to have triggered ops, you can do something more creative, but there are few such networks.
I'm all for adding this to MPI RMA, but bear in mind the performance caveats -- if you have a network that is good at PGAS (i.e. reliable, unordered datagrams) there's a good chance it will be bad at fence.
FWIW, you can use MPI accumulates if you want ordering. This works fine for scalar operations, but most networks have lower throughput for vector atomics versus put/get.
We let RMA users turn off ordering in accumulate. Why not let them turn off atomicity as well? We'd have to change a nontrivial amount of text but the end result would be what appears to be desired here, and perhaps a more efficient implementation of it, since it would not require an additional function call to prescribe ordered put and get.
Feature | ordered | unordered |
---|---|---|
atomic | accumulate ops | accumulate ops + info |
not atomic | put/get + flush | put/get |
Performance | ordered | unordered |
---|---|---|
atomic | 🙂 | 🤔 |
not atomic | 😕 | 🙂 |
What does ordering mean without atomicity for a single memory location?
I think @devreal's use case is for multiple separate memory locations. So, in some sense, it's orthogonal to PUT/GET vs. ACC/GACC.
I think the point of this proposal is that FLUSH
does two things: (1) ensures operations after the FLUSH are ordered with operations before FLUSH and (2) makes sure the data is visible to other processes. This proposal is to decouple them, so the user can get just (1) without (2), if needed. This is independent of ACC/GACC vs. PUT/GET.
Thanks, I had missed that. This is much stronger than what shmem_fence
guarantees. shmem_fence
only guarantees that the targeted process will see the operations in order. That is, point-to-point, not global ordering.
I believe this proposal is point-to-point ordering too, i.e., a subset of FLUSH
, not FLUSH_ALL
.
What does ordering mean without atomicity for a single memory location?
It means that the updates from a single source have a well-defined ordering, which produces the same results as ordered atomic updates because the updates are not concurrent. It does not imply that updates from multiple sources are defined, in contrast to atomic operations.
This behavior is equivalent to non-atomic store instructions that execute in-order.
I think @devreal's use case is for multiple separate memory locations.
That isn't stated so I cannot assume it. It is completely reasonable to want updates that appear in order but are not atomic when the user knows that any element is only updated by a single source.
What does ordering mean without atomicity for a single memory location?
It means that the updates from a single source have a well-defined ordering, which produces the same results as ordered atomic updates because the updates are not concurrent. It does not imply that updates from multiple sources are defined, in contrast to atomic operations.
Note that I said to a single memory location. In that case, the data will be corrupted. Why would one need ordering if the data is corrupted anyway?
This behavior is equivalent to non-atomic store instructions that execute in-order.
FWIW, store instructions are ordered only on some architectures, such as x86
. But that's also an orthogonal discussion.
I think @devreal's use case is for multiple separate memory locations.
That isn't stated so I cannot assume it. It is completely reasonable to want updates that appear in order but are not atomic when the user knows that any element is only updated by a single source.
I was just clarifying what I believe to be the intent of the proposal based on the discussion. You are right that the proposal is not defined well enough to actually confirm that.
@jeffhammond @pavanbalaji You're right, I didn't mean to put up a full proposal but rather meant to ask whether this topic is of interest (it seems at least controversial ^^).
I put together a proposal as an extension to the MPI standard document (page 476, Ctrl+F for "ticket10").
Let me clarify what semantics I want to achieve. From the proposal:
MPI_WIN_ORDER ensures that all previously issued RMA operations initiated by the calling process to the target rank on the specified window complete at the target before operations issued following this call.
I want to cover both cases that were discussed here: 1) Ordering writes and reads to the same memory location at the same target (my original example). Yes, this can be done using atomics. However, my intent is not to have atomic semantics, i.e., I know there won't be concurrent writes.
2) Ordering writes to to non-overlapping memory locations at the same target (my second example: https://github.com/mpiwg-rma/rma-issues/issues/10#issuecomment-485698612). This cannot be done using atomic operations but would be possible using fence semantics.
Both cases can be solved using flushes waiting for remote completion. However, there are valid use-cases where only partial ordering is required but a flush with remote completion can be deferred until a later point when all operations have been posted. In essence, I can communicate my intent to the MPI implementation and let it decide what the best strategy to achieve what I want is on the current platform. If the best strategy happens to be a flush, so be it (see advice to implementors in the annotated standard).
Also to clarify: there is no notion of global ordering, the ordering is always only guaranteed for operations from the same origin to the same target (on the same window).
As promised I did some measurements using OpenShmem: two put operations to distinct memory locations on the same target PE, required to be ordered, completed by quiet
:
uint64_t val = 0;
uint64_t* target = shmem_malloc(2*sizeof(uint64_t));
shmem_long_put(target_ptr, &val, 1, target_pe);
shmem_fence/quiet();
shmem_long_put(target_ptr+1, &val, 1, target_pe);
shmem_quiet();
The ordering is either guaranteed by a fence
or quiet
. I measured 100k repetitions and the values were reproducible across several runs.
On a Cray XC40 using Cray's shmem implementation, I do not observe any difference between fence
and quiet
. Unfortunately, I wasn't able to get both the OpenShmem reference implementation and Open MPI's shmem implementation to work within finite time (that is, Open MPI + UCX went up in flames and the reference implementation comes with a tail of dependencies I haven't bothered installing yet)
However, on a Mellanox ConnextX-3 cluster using Open MPI + UCX, the difference between fence
and quiet
is almost 1.8x, i.e., calling shmem_fence
in between the two operations instead of shmem_quiet
reduces the time for two operations from 4.1us to 2.2us.
That seems to confirm what @jdinan said about the differences in networks (or is it just that Cray's implementation is bad at fences?)
@jdinan I took a closer look at the OpenShmem standard, which explicitly states that non-blocking get operations are not included in the ordering guarantees of a fence
and that a fence only guarantees delivery, not completion. I don't see these limitations explicitly stated for UCX or ibverbs. Any insight into why OpenShmem provides weaker guarantees here?
What does ordering mean without atomicity for a single memory location?
It means that the updates from a single source have a well-defined ordering, which produces the same results as ordered atomic updates because the updates are not concurrent. It does not imply that updates from multiple sources are defined, in contrast to atomic operations.
Note that I said to a single memory location. In that case, the data will be corrupted. Why would one need ordering if the data is corrupted anyway?
This comment makes no sense. If writes are ordered, data is not corrupted, because they happen in a well-defined sequence. Obviously, this only works for a single source, but that is what we are discussing.
@devreal I believe that Cray's implementation of fence is quiet, because they use end-to-end rather than link-level reliability. This is merely speculation however - only somebody from Cray can speak with authority on what their proprietary blobs do.
If writes are ordered, data is not corrupted, because they happen in a well-defined sequence.
That's a reasonable requirement. If that was what was proposed, then I missed it.
@devreal OpenSHMEM fence is intended to be a write fence. Ordering for gets remains an open topic
@jdinan Is that because it makes the implementation of fence easier/feasible on more architectures? Should we carry this restriction over to MPI?
No, we should define order in a way that meets MPI’s goals of completeness and orthogonality. SHMEM is full of historical artifacts that only made sense on the Cray T3E or which don’t support the full range of one-sided communication usage models.
It's common to be able to order/complete reads and writes separately. I don't think separating them is harmful, as long as it doesn't leave gaps in the memory model. There is an assumption in SHMEM that blocking Gets are ordered (because they complete in program order). If you implement Get operations using ld/st instructions on a weakly ordered architecture, the operations may not be ordered. This has forced some implementations to add a read fence to Gets, which is bad for performance.
@jdinan Yes, these should be separable. Something like MPI_Win_fence
assertions should do the trick.
@jdinan @jeffhammond I was just about to add an assert
parameter that accepts something like MPI_MODE_NOGET
to limit the ordering to puts and atomics. However, Section 11.5.5 states:
The assert argument does not change program semantics if it provides correct information on the program
AFAICS, the assert MPI_MODE_NOGET
would change program semantics. Turning things around and allowing the user to pass MPI_MODE_NOPUT
~and/or MPI_MODE_NOACC
(tbd)~ to MPI_WIN_ORDER
to signal that there have not been preceding modifications would allow the implementation to avoid the read fence and still provide ordering with succeeding puts. It's not the same as limiting MPI_WIN_ORDER
to put and atomics.
As an alternative, we could add a fence_type
parameter (similar to lock_type
for MPI_Win_lock
) that determines whether MPI_WIN_ORDER
includes get
or not. In that case we could have MPI_ORDER_FULL
and MPI_ORDER_NOGET
. Unfortunately, MPI already defines MPI_ORDER_C
and MPI_ORDER_FORTRAN
so we have to find another namespace...
The semantics of these asserts need be clearly stated. Does MPI_MODE_NOGET
mean that I cannot use MPI_GET
in my program, or does it mean that the MPI_GET
calls are not ordered? If it's the latter, is that only for ordering with respect to other MPI_GET
operations or all operations?
As an alternative approach, would MPI_WIN_IFLUSH
solve the same issue?
The semantics of these asserts need be clearly stated. Does
MPI_MODE_NOGET
mean that I cannot useMPI_GET
in my program, or does it mean that theMPI_GET
calls are not ordered? If it's the latter, is that only for ordering with respect to otherMPI_GET
operations or all operations?
MPI_MODE_NOGET
would mean that the ordering request does not apply to preceding and succeeding MPI_GET
calls but only affect put and atomic operations. The user is still allowed to use MPI_GET
before and after the call. I'm not sure if there is much use for ordering MPI_GET
operations independent of put/atomic operations. MPI_MODE_NOGET
isn't strictly an assertion but a request, something I didn't immediately think about. Alternatively, MPI_MODE_NOPUT
could be used to signal that there have not been conflicting put/accumulate operations that would require a memory read fence on some architectures. Maybe that is enough already?
Would MPI_WIN_IFLUSH
trigger a flush and return a request that can be used to wait for all previously posted operations to complete? Would it imply ordering relative to operations posted after the call to MPI_WIN_IFLUSH
? In principle, this seems like an interesting concept, also in light of the discussion on request-based atomic operation (https://github.com/mpi-forum/mpi-issues/issues/128). It seems a bit heavier and involves more work on the user-level due to request handling though.
If we need to order future GETs with previous PUTs, for example, does this proposal require us to force ordering for all operations?
@pavanbalaji I'm not sure I fully understand your question. Do you mean whether or not the proposal requires ordering of non-conflicting PUTs and GETs? In its current form yes: all previous PUTs would have to complete at the target before any future GETs may read their value from the target. The current wording is mostly borrowed from OpenShmem, which doesn't include GETs so that hasn't been an issue there...
I guess this can be relaxed to only require ordering of conflicting PUTs and GETs. After all, GETs do not observe the outcome of non-conflicting PUTs...
Should be make it a flag what the fence applies to? Make the flags bitwise MPI_ORDER_READ, MPI_ORDER_WRITE. That would be useful for shared memory.
@devreal My point is that your proposed hints do not have read-after-write or write-after-read ordering. They only have write-after-write (i.e., put-order-put) and read-after-read (i.e., get-order-get). I don't know if they are needed, but I'm just calling out the holes so we can make an explicit decision to either add them or ignore them.
@hjelmn Is it a problem that MPI_ORDER_C
already exists, i.e., that there is somewhat of a namespace clash?
@pavanbalaji @hjelmn Is it sufficient to control whether GETs are included in the ordering (as per @hjelm's suggestion) or is a more fine-grained control similar to the accumulate_ordering
info key more helpful (bitwise combinable MPI_ORDER_RAW
, MPI_ORDER_WAR
, MPI_ORDER_WAW
, MPI_ORDER_RAR
)? Also, would passing MPI_ORDER_READ
or MPI_ORDER_RAR
alone be anything else but a noop?
I generally like having more control on the actual guarantees provided (so more options). But I was merely asking the question, not suggesting you change your proposal to be one way or another.
@pavanbalaji I understand, and your question pointed at something I hadn't considered :) The current wording with regard to GETs is too strict and there seems to be a common interest in adding fine-grained control. I'd like to get to a consensus on how that control should look like :)
Also, would passing
MPI_ORDER_READ
orMPI_ORDER_RAR
alone be anything else but a noop?
I don't think this would be a no-op. Consider this, admittedly convoluted, example:
All processes create a window with no accumulate ordering. Initial value of X
(located on process P2
) is 0
.
P0:
a = GET_ACCUMULATE(x, NO_OP)
WIN_ORDER(RAR)
b = GET_ACCUMULATE(x, NO_OP)
WIN_FLUSH
if (b == 0)
assert(a == 0);
P1:
ACCUMULATE(x = 1)
WIN_FLUSH
After skimming through the documentation of existing memory fences I updated the proposal to include an order_type
(as proposed by @hjelmn) that is the bit-wise OR combination of one or more of MPI_WIN_ORDER_READ
, MPI_WIN_ORDER_WRITE
, and MPI_WIN_ORDER_DATA
, plus MPI_WIN_ORDER_FULL
, which is a shorthand for all three.
MPI_WIN_ORDER_READ
requires completion of all previously issued reads at the target before reads issued following the call. This is similar to an lfence
.
MPI_WIN_ORDER_WRITE
requires completion of all previously issued RMA writes at the target before writes issued following the call. This is similar to an sfence
.
MPI_WIN_ORDER_DATA
requires the ordering of reads and writes on overlapping memory regions, similar to the default for accumulate operations but including non-atomic operations. This should not need special instructions on most shared memory architectures.
MPI_WIN_ORDER_FULL
is similar to an mfence
.
For inter-node communication, all ordering types will require a fence in network hardware (where available) or some other form of ordering mechanism.
The proposal now also contains two examples, a rationale for the order types, and an adapted Section 11.7.2 (Ordering): mpi-report.pdf
Limitations
Any feedback would be much appreciated :)
I updated the draft document: mpi-report-memalign.pdf
I am not sure about the exclusion of loads and stores from the ordering. It might be helpful for users of shared memory windows but would prevent implementations from deferring the memory fence to the next put/get to/from shared memory targets. Opinions?
Unless there are objections or concerns about the current design, I'd like to put this into a PR in time for a first reading in Chicago. Please let me know if you think that's premature.
Currently the MPI standard does not provide any ordering guarantees for puts and gets unless remote completion is performed in between two operations. For atomic operations, only accesses to same/overlapping memory regions are guaranteed to be ordered (if origin and target are the same).
As far as I can see, many (most?) high-performance networks provide the notion of a
fence
to inject an ordering request into a stream of operations, i.e., to ensure that one operation is completed before the next is performed without having to wait for remote completion at the origin.Example:
This may be more efficient than having two flushes because we only have one round-trip instead of two, which is esp. useful for latency-bound use-cases (single-value atomics/reads/writes) and to ensure memory consistency in frameworks built on top of MPI RMA.
Libraries such as OpenShmem and UCX both offer this functionality.
Hence my question: has this been discussed (and dismissed?) in the past? Is there a reason not to add something like
MPI_Win_order
that triggers a hardware fence if available? (sinceMPI_Win_fence
is already taken) On networks that don't support fences in hardware, implementations may always fall-back to a full flush, which is what the user would do otherwise at the moment.