openshmem-org / specification

OpenSHMEM Application Programming Interface
http://www.openshmem.org
51 stars 41 forks source link

Blocking gets are unordered #233

Open anshumang opened 6 years ago

anshumang commented 6 years ago

Umbrella issue #229

A blocking get operation (shmem_g, shmem_get, shmem_iget), as defined in the OpenSHMEM specification v1.4, returns only after the the data has been delivered to the destination array at the local PE. For independent get operations, this requires implementations on relaxed memory model architectures to include appropriate memory barriers in each get operation, resulting in sub-optimal performance. UPDATE : The description of blocking get enforces completion. But does that also imply ordering? If not, there could be more clarification on this.

This proposal places the requirement on the programmer to use a fence where ordering is required between independent get operations. UPDATE : The reordering can be observed from another thread in the calling thread's PE. Is there any other way to observe the reordering? Suggestion from @jdinan below is to not consider memory view from other threads when deciding OpenSHMEM semantics.

The completion semantics of get remain unchanged from the v1.4 specification: the result of the get is available for any dependent operation that appears after it, in program order.

In example 1, line b is data dependent on the result of the get operation in line a. Lines a and b are guaranteed to execute in program order. Hence, the output where j takes value 0 is prohibited.

Example 1 Input : x(1) = 1, i = 0, j = 0 a: i = g(x,1) | data dependency V b: j = i Output : i = 1, j = 0 OpenSHMEM v1.4 : prohibited Current proposal : prohibited

In example 2, get operations on lines a and c of PE2 are unrelated and can be reordered per this proposal. Hence the result where j takes value 0 in line c after i takes value 1 in line a is allowed as observed from PE 2.

Example 2 Input : x(1) = 0, y(1) = 0, i = 0, j = 0 PE 0 a: p(x,1) b: fence() c: p(y,1)

PE 2 a: i = g(y, 1) b: use i (UPDATE : b is unnecessary, the interpretation of spec v1.4 is that a and c are ordered even without b) | c can reach PE1 before a (requires fence for ordering) V c: j = g(x, 1)

Output : x(1) = 1, y(1) = 1, i = 1, j = 0 OpenSHMEM v1.4 : prohibited Current proposal : allowed

anshumang commented 6 years ago

Related comments from @minsii -

I am still confused how the out-of-order cores can reorder blocking gets and be visible to user programs, and fence between blocking gets becomes necessary. Below is my thought, it might be incorrect/incomplete. Could you please give more detailed explanation ?

For network-offloaded get:

shmem_get(dest, P1)
  -- (1) CPU issues network read to P1
  -- (2) network transfers data from remote P1 to local dest buffer
  -- (3) CPU confirms local completion of (2) and then return to user program

Should the mechanism of (3) ensures that (2) has already been performed and completed ?

For active-message based get:

shmem_get(dest, P1)
  -- (1) CPU issues read-request packet to P1
  -- (2) CPU waits till received ack from P1
  -- (3) CPU copies data into local dest buffer
  -- (4) return to user program
load dest;

I could imagine out-of-order execution of (3) and (4) in the AM-based case, but (3) must be done when program loads dest.

Reading again the slides @anshumang used in WG calls, I understood that the proposal is to require fence() (memory barrier in this case ?) to order the completion of two blocking gets on local PE (seems needed only for AM-case). But such out-of-order seems never visible to single-threaded user program.

Now thinking about the threaded program, where load dest maybe performed by another core, thus such unordered gets becomes visible to user program. But do we always need additional cache coherence synchronization between T0 and T1 in this case ?

T0                                       T1
shmem_get(dest1);                    
shmem_get(dest2);
                                         load dest2;
                                         load dest1;
jdinan commented 6 years ago

How are you synchronizing between T0 and T1 threads in the example above and why does that synchronization not resolve the RAW consistency issue? Consistent views of memory across threads is a problem that should be addressed by the threading package, not by OpenSHMEM.

For more details on why attempting to define a memory model in a library is a poor choice, see: http://www.hpl.hp.com/techreports/2004/HPL-2004-209.pdf

anshumang commented 6 years ago

4fbc3ae is a first attempt at updating the RMA and non-blocking RMA sections of the spec related to ordering.

jdinan commented 6 years ago

As a matter of process, you might want to limit this proposal to a minimal set of edits needed to add/update the desired semantics. Standard-ese is notoriously hard to get right. There appear to be many edits in here that are unrelated to the goal of this proposal, but will likely delay its progress. It would also be helpful if outside of the proposal, you enumerate the semantic updates that the proposal introduces to help a reader, since the semantic changes may be subtle.

jdinan commented 6 years ago

Furthermore, two successive calls to \FUNC{shmem_g} routine may deliver data out of order unless a call to \FUNC{shmem_fence} or a read of the return value is introduced between the two calls.

Is this semantic intended to target a usage model like this?

Thread 0 Thread 1
x = shmem_g(&x, ...)
y = shmem_g(&y, ...) while (!y) ;
-- assert(x > 0);

If so, I'm not sure overloading shmem_fence is the right solution here (or even in a non-threaded case since fence has non-local actions). I assume that the calling thread should observe the fetched results in order. I would prefer to support the above pattern only through shmem_wait as we presently do. Otherwise, users should use the threading model to resolve this.

anshumang commented 6 years ago

@jdinan Will update the PR with only edits related to the proposed semantic change in get APIs like you suggest. I will go ahead and create a separate description of semantic changes noting that it may be dropped based on what is agreed on here.

minsii commented 6 years ago

@jdinan :I agree with your point. That is also the reason why I listed all possible use cases, in order to figure out whether we need change shmem_fence to order blocking gets. My opinion is unneeded.

Consistent views of memory across threads is a problem that should be addressed by the threading package, not by OpenSHMEM

anshumang commented 6 years ago

@jdinan here is a commit with changes restricted to get ordering

anshumang commented 6 years ago

@jdinan @minsii I have updated example 2 in the issue description above. If independent gets are allowed to be reordered, they can reach the target to observe puts in different order. Do we allow this reordering of gets? If so, can fence between the gets be used to prevent the reordering?

jdinan commented 6 years ago

@anshumang The changes in https://github.com/anshumang/specification/commit/d31c087ed3feed7cf73f94465f159c7f0c2a630e look reasonable. One suggestion, the "destination PE" wording feels a bit odd. What about "calling PE"?

jdinan commented 6 years ago

Regarding example 2, my reading of the specification suggests that statement PE2.c must see the value written at PE0.a without any additional OpenSHMEM calls by PE 2.

anshumang commented 6 years ago

@jdinan That matches my interpretation too. But to guarantee this semantic, a memory barrier is needed in the implementation of get APIs on relaxed memory model architectures like Power, ARM and GPUs. This ticket aims to relax the requirement that PE2.c must observe PE0.a.

Expanding the scope of fence() to enforce ordering in such a case (PE2.c must observe PE0.a) is also part of the proposal (#232).

minsii commented 6 years ago

@anshumang @jdinan I think example 2 is a good justification. If I want to implement shmem_get as CPU load via shared memory (e.g., PE1 and PE2 are in shared memory in the example), load x and load y on PE2 can be out-of-order.

   PE0                  PE2
a. p(x,1, PE1);     
b. fence;     
c. p(y,1, PE1);         a. while(g(y, PE1) != 1); // while (load y != 1)
                        c. assert(g(x, PE1) == 1);  // assert (load x == 1)

Now my questions are: (1) Whether the above program should be supported ? Do we want to just use the new signal API for this user case ? (2) Is shmem_fence the appropriate API to enforce ordering of local memory operations ? It is only for "remote ordering" in the current spec.

jdinan commented 6 years ago

@anshumang Before we can introduce a new semantic, I think we need a ticket to clarify the existing memory model. Once there is agreement on existing semantics, we should be able to address some of these inefficiencies.

jdinan commented 6 years ago

@minsii (1) Yes, I believe this program (assuming get/put operations) already has well-defined behavior in OpenSHMEM. (2) shmem_fence introduces a lot of overhead if all you needed is a processor read fence.

shamisp commented 6 years ago

I'm a bit lost between multiple tickets discussing the same stuff...

@minsii I think earlier you asked question how shmem_get "blocking" call gets reordered ? It will covered by earlier example in #229 where 3 PEs implement synchronization using blocking get routine. Within shared memory space it is equivalent of `load instructions that can be executed out of order (even so it is blocked).

@jdinan @anshumang https://github.com/anshumang/specification/commit/d31c087ed3feed7cf73f94465f159c7f0c2a630e Does the destination mean initiator "PE" ? IMHO target and initiator is much better descriptors. I'm not sure what is "copied data".

probably we can extend the shmem_fence() to shmem_fence(op) where of is load, store, or system wide (all possible ops).

minsii commented 6 years ago

@shamisp Thanks for clarification. Actually, after thinking the example again (PE1 and PE2 are on shared memory), it looks like introducing a memory barrier on PE2 is not sufficient to make the program correct. That is, if my put operations are handled as active messages, the store operations on PE1 can be also out-of-order. Does current shmem_fence include a memory barrier on PE1 ?

 PE0                  PE1             PE2
a. p(x,1, PE1);   a.store x=1;
b. shmem_fence;     
c. p(y,1, PE1);   c.store y=1;    a. while(g(y, PE1) != 1); // while (load y != 1)
                                  b. memory barrier
                                  c. assert(g(x, PE1) == 1);  // assert (load x == 1)
bcernohous commented 6 years ago

My thought is shmem_fence is supposed to guarantee ordering of the puts. So if your implementation uses active messages, you’ll have to ‘do something’ to guarantee ordering. Other thoughts?

PE0 PE1 PE2

a. p(x,1, PE1); a.store x=1;

b. shmem_fence; x.fence

c. p(y,1, PE1); c.store y=1; a. while(g(y, PE1) != 1); // while (load y != 1)

                              b. memory barrier

                              c. assert(g(x, PE1) == 1);  // assert (load x == 1)

From: Min Si [mailto:notifications@github.com] Sent: Thursday, August 09, 2018 4:59 PM To: openshmem-org/specification specification@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [openshmem-org/specification] Memory Model (ordering + reads from = happens before) : Blocking gets are unordered (#233)

@shamisphttps://github.com/shamisp Thanks for clarification. Actually, after thinking the example again (PE1 and PE2 are on shared memory), it looks like introducing a memory barrier on PE2 is not sufficient to make the program correct. That is, if my put operations are handled as active messages, the store operations on PE1 can be also out-of-order. Does current shmem_fence include a memory barrier on PE1 ?

PE0 PE1 PE2

a. p(x,1, PE1); a.store x=1;

b. shmem_fence;

c. p(y,1, PE1); c.store y=1; a. while(g(y, PE1) != 1); // while (load y != 1)

                              b. memory barrier

                              c. assert(g(x, PE1) == 1);  // assert (load x == 1)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/openshmem-org/specification/issues/233#issuecomment-411911129, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AH7d-sO0FkOd4oVnXkSBcjb5rcRNH9aIks5uPLCwgaJpZM4VgiFH.

anshumang commented 6 years ago

@shamisp Sorry for the confusion. The discussion on #229 progressed quickly so the comments are spread across this ticket and #229. Also removed the long prefix Memory Model (ordering + ...) from the ticket names to avoid confusion.

Does the destination mean initiator "PE" ? It can be both initiator (get) and target (put). @jdinan had similar feedback that it was confusing. Will update and write back here.

@jdinan Like you suggested, I will create a new ticket for clarifications of the spec before we can move forward on this ticket.

shamisp commented 6 years ago

@minsii @bcernohous If your library unpacks the messages out of the network buffer to the application buffer, technically you would have to put store barrier somewhere or you can introduce some artificial dependency that will order the stores.

Practically, with a lot of interconnects in order to process the completion queue that notifies you about the message arrival you would have to execute some sort of barrier.

shamisp commented 6 years ago

@anshumang semantically we should decouple blocking semantics from ordering semantics. There is implicit assumption there that instruction are executed in programmable order....

naveen-rn commented 4 years ago

I suppose this is not for OpenSHMEM-1.5.

jdinan commented 4 years ago

Yes, this will land in 1.6. IIRC, the conclusion we reached in this discussion is that blocking fetching atomics are ordered and that nonblocking variants are not. We may also want to introduce a blocking variant that is unordered.