Memory model umbrella ticket

anshumang commented 6 years ago

Summary : ordering (inside a PE) + reads from (between two PEs) = happens before (across all PEs)

Following are the items that have been discussed in RMA WG on 6/21, 7/5 and 7/19 and are still open (except one marked by ^). They are grouped below 1) under ordering, 2) reads from and 3) happens before.

1) ordering

Fetch AMOs are ordered (related: fence is not required to order fetch AMOs)
Blocking get/g/iget are not ordered (related: fence orders blocking get/g/iget) (#233)
fence also orders non-blocking get/g/iget (all fence behavior changes #232 )
Data-movement collective APIs using same psync are ordered in the order of their issue

2) reads from

All communication APIs progress without requiring quiet or barrier
Using shmem_put/iput/p to trigger shmem_wait_until is platform/implementation defined (related: shmem_atomic_set should be used to trigger shmem_wait_until)
New APIs to trigger shmem_wait_until that need to be single-copy atomic but not read-modify-write atomic
Trigger shmem_wait_until using the same type^
wait_until on remote symmetric memory
wait_until on local non-symmetric memory (related : when non-blocking fetch AMOs trigger wait_until, does it require read-modify-write-fetch to be atomic?)

3) happens before None yet

anshumang commented 6 years ago

@spotluri @jdinan @manjugv @nspark @minsii @khamidouche and others Please feel free to add if anything is missing. Plan is to create separate issues for tracking each of the items above and then create PRs for the proposed changes to the spec.

shamisp commented 6 years ago

Using shmem_wait_until in combination with AMO only is difficult one. It somewhat simple if you only looking at shared memory use case only. It is much more complicated if initiator of AMO is located in different coherency domain in respect to target running waituntil loop. The local load operation may not be atomic in respect to remote atomics.
AMO is expensive operation when it is compared regular PUT. It is slower and limited in number of outstanding operations. On the other hand waking up a remote PE with regular PUT even partial one can be perfectly fine way of notification.

anshumang commented 6 years ago

Definition of concurrency used in #204 to be resolved in this.

anshumang commented 6 years ago

Have renamed (can be improved) to distinguish from #172

minsii commented 6 years ago

Is the related statement a mistake (get/g/iget ->put/p/iput) ?

Blocking get/g/iget are not ordered (related: fence orders blocking get/g/iget)

Do you want to order the delivery of data returning to the local buffer, or order the read access of the remote memory ? I think only the later is useful. E.g., a user may want to do put_nbi->fence->get_nbi.

fence also orders non-blocking get/g/iget

I am not sure if I understand this topic correctly. I have two questions: (1) Does wait_until check the return buffer of nonblocking fetch AMO on the source PE ? (2) What is the read-modify-write-fetch operation and when it is needed ?

wait_until on local non-symmetric memory (related : when non-blocking fetch AMOs trigger wait_until, does it require read-modify-write-fetch to be atomic?)

anshumang commented 6 years ago

@minsii

Is the related statement a mistake (get/g/iget ->put/p/iput) ? No, the proposal is to make g/get/iget unordered as it requires memory fences on some relaxed architectures (if the data from get is used, ordering is enforced by the compiler/architecture)

Do you want to order the delivery of data returning to the local buffer, or order the read access of the remote memory ? The original context for this was a comment from @nspark on the draft that fence ordering blocking and non-blocking put but only blocking get may be non-intuitive. A follow up question - why is the ordering of the local buffer update (for non-blocking get) not useful? Is it not a requirement for message passing to work?

I have two questions: (1) Does wait_until check the return buffer of nonblocking fetch AMO on the source PE ? (2) What is the read-modify-write-fetch operation and when it is needed ? (1) Yes, this was suggested by @jdinan in the discussion on the mailing list. (2) This is related to your comment in the same thread - Will it increases the overhead of fetching AMOs if we need atomicity guarantee ? E.g., is there any network that supports atomicity for the returning data transfer of AMO ? Is my understanding correct?

bcernohous commented 6 years ago

fence also orders non-blocking get/g/igetÂ

Sorry, I admit to not following this discussion closely enough,

...'orders', not 'completes' ...

get_nbi() fence() get_nbi()

so I'm guaranteed that the second get will be ordered after the first get?

PE 0 PE 1 PE n

put(data, pe1) fence(); put(signal=1, pe1)

                                        get_nbi(signal, pe1)
                                        fence()
                                        get_nbi(data, pe1)

If gets signal=1 then data is valid (from PE 0) since it was ‘signalled’

From: Anshuman Goswami [mailto:notifications@github.com] Sent: Tuesday, July 24, 2018 2:27 PM To: openshmem-org/specification specification@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [openshmem-org/specification] Memory Model (ordering + reads from = happens before) (#229)

@minsiihttps://github.com/minsii

Is the related statement a mistake (get/g/iget ->put/p/iput) ? No, the proposal is to make g/get/iget unordered as it requires memory fences on some relaxed architectures (if the data from get is used, ordering is enforced by the compiler/architecture)

Do you want to order the delivery of data returning to the local buffer, or order the read access of the remote memory ? The original context for this was a comment from @nsparkhttps://github.com/nspark on the draft that fence ordering blocking and non-blocking put but only blocking get may be non-intuitive. A follow up question - why is the ordering of the local buffer update (for non-blocking get) not useful? Is it not a requirement for message passing to work?

I have two questions: (1) Does wait_until check the return buffer of nonblocking fetch AMO on the source PE ? (2) What is the read-modify-write-fetch operation and when it is needed ? (1) Yes, this was suggested by @jdinanhttps://github.com/jdinan in the discussion on the mailing list. (2) This is related to your comment in the same thread - Will it increases the overhead of fetching AMOs if we need atomicity guarantee ? E.g., is there any network that supports atomicity for the returning data transfer of AMO ? Is my understanding correct?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/openshmem-org/specification/issues/229#issuecomment-407523643, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AH7d-ha-jMQ3K9K5FoAH-kk50iXTkPGVks5uJ3T8gaJpZM4VXMLn.

anshumang commented 6 years ago

Thanks @bcernohous for the example. If I could use it to clarify my earlier comment : fence orders get_nbi implies that the local update is ordered. @minsii comments?

minsii commented 6 years ago

@bcernohous: The example seems a little problematic to me. How do you guarantee that get_nbi(signal, pe1) reads signal on PE1 after the update of put(signal=1, pe1) ? Do you have to add another synchronization between PE0 and PEn ? E.g., PE n must issue the get_nbi operations after completion of PE 0's put(signal=1, pe1).

PE 0                 PE 1                   PE n

put(data, pe1)
fence();
put(signal=1, pe1)

                                            get_nbi(signal, pe1)
                                            fence()
                                            get_nbi(data, pe1)

bcernohous commented 6 years ago

My email was an example with questions 😊

so I'm guaranteed that the second get will be ordered after the first get?

If gets signal=1 then data is valid (from PE 0) since it was ‘signalled’

And I guess the answer is yes to both?

From: Anshuman Goswami [mailto:notifications@github.com] Sent: Tuesday, July 24, 2018 4:28 PM To: openshmem-org/specification specification@noreply.github.com Cc: Bob Cernohous bcernohous@cray.com; Mention mention@noreply.github.com Subject: Re: [openshmem-org/specification] Memory Model (ordering + reads from = happens before) (#229)

Thanks @bcernohoushttps://github.com/bcernohous for the example. If I could use it to clarify my earlier comment : fence orders get_nbi implies that the local update is ordered. @minsiihttps://github.com/minsii comments?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/openshmem-org/specification/issues/229#issuecomment-407558172, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AH7d-iYAWXscz4GaZaP7B9taRbrw-jBsks5uJ5F6gaJpZM4VXMLn.

bcernohous commented 6 years ago

I was questioning if that was how my example was supposed to work.

How do you guarantee that get_nbi(signal, pe1) reads signal on PE1 after the update of put(signal=1, pe1) ?

I don’t. It was an ordering question. If pe gets signal=1 , then the data is from PE 0. PE could get signal = ? (in my poor example) and there would be nothing else you could assert about the ordering.

As I said, I haven’t followed this discussion closely enough and was surprised that fence orders get_nbi, and I’m trying to understand it too.

From: Min Si [mailto:notifications@github.com] Sent: Tuesday, July 24, 2018 4:43 PM To: openshmem-org/specification specification@noreply.github.com Cc: Bob Cernohous bcernohous@cray.com; Mention mention@noreply.github.com Subject: Re: [openshmem-org/specification] Memory Model (ordering + reads from = happens before) (#229)

@bcernohoushttps://github.com/bcernohous: The example seems a little problematic to me. How do you guarantee that get_nbi(signal, pe1) reads signal on PE1 after the update of put(signal=1, pe1) ? Do you have to add another synchronization between PE0 and PEn ? E.g., PE n must issue the get_nbi operations after completion of PE 0's put(signal=1, pe1).

PE 0 PE 1 PE n

put(data, pe1)

fence();

put(signal=1, pe1)

                                        get_nbi(signal, pe1)

                                        fence()

                                        get_nbi(data, pe1)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/openshmem-org/specification/issues/229#issuecomment-407561756, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AH7d-o4YdP3yCK923O5WWeJdnJ8Xhvteks5uJ5TUgaJpZM4VXMLn.

minsii commented 6 years ago

@anshumang :

I am a little confused. If I understand correctly, g/get/iget is already unordered in current semantics. Do you propose to only clarify the ordering semantics (no semantics change) ?

No, the proposal is to make g/get/iget unordered as it requires memory fences on some relaxed architectures (if the data from get is used, ordering is enforced by the compiler/architecture)
I do not expect any network hardware supports the atomicity of the entire processing of read-modify-write(on remote object) + write(on local return buffer). Actually, if we want to allow wait_until+nonblocking fetch AMO combination, we need additional atomicity on the local process between write(on local return buffer) in nbi fetch AMO and read(from local return buffer) in wait_until.

(2) This is related to your comment in the same thread - Will it increases the overhead of fetching AMOs if we need atomicity guarantee ? E.g., is there any network that supports atomicity for the returning data transfer of AMO ? Is my understanding correct?

anshumang commented 6 years ago

@minsii From 9.5.4 in spec v1.4 description of shmem_get - The routines return after the data has been delivered to the dest array on the local PE. Does this not imply that shmem_get is ordered with respect to other operations?

anshumang commented 6 years ago

@bcernohous fence orders non-blocking get came up in context of the new requirement that fence would order blocking gets (which are now unordered). I think it could be helpful from a user's perspective to be able to order blocking and non-blocking gets the same way. Is there a fundamental performance issue to guarantee this?

minsii commented 6 years ago

@anshumang: The blocking get/g/iget must be completed at the return of the routine. In that sense, the ordering between two blocking get operations are always ordered on the local PE. However, it is irrelevant to the shmem_fence semantics. For instance, there is no ordering between a blocking put and a blocking get. In the following example, the update of x by put might be delivered on PE1 after the return of get. shmem_fence does not help.

shmem_put(x, PE1); /* local completion at return */
shmem_get(x, PE1);

anshumang commented 6 years ago

@minsii

In that sense, the ordering between two blocking get operations are always ordered on the local PE.

The proposal is to relax this requirement.

shamisp commented 6 years ago

@minsii out-of-order core may executer in-depended loads (a.k.a. shmem "blocking" get ) out of order.

shamisp commented 6 years ago

@anshumang What is really surprising is that Cray T3D ("father" of openshmem) was using Alpha, which is out-of-order core. I just cannot imagine ordered loads on this platform. Original spec also had explicit ops for cache management. So my guess the ordering was happing through the cache invalidation routines. Otherwise you can complete the load from the local memory regardless what other side "put" there.

Looking at the original manual I only see barrier, no shmem_fence or shmem_quite operations. My guess these two were introduced post 1994.

anshumang commented 6 years ago

Thanks for the comments @shamisp Can you please add the pointer to the original spec?

shamisp commented 6 years ago

https://www.cs.cmu.edu/afs/cs/project/cmcl/link.iwarp/OldFiles/archive/fx-papers/cri-shmem-users-guide.ps

minsii commented 6 years ago

@shamisp I am still confused how the out-of-order cores can reorder blocking gets and be visible to user programs, and fence between blocking gets becomes necessary. Below is my thought, it might be incorrect/incomplete. Could you please give more detailed explanation ?

For network-offloaded get:

shmem_get(dest, P1)
  -- (1) CPU issues network read to P1
  -- (2) network transfers data from remote P1 to local dest buffer
  -- (3) CPU confirms local completion of (2) and then return to user program

Should the mechanism of (3) ensures that (2) has already been performed and completed ?

For active-message based get:

shmem_get(dest, P1)
  -- (1) CPU issues read-request packet to P1
  -- (2) CPU waits till received ack from P1
  -- (3) CPU copies data into local dest buffer
  -- (4) return to user program
load dest;

I could imagine out-of-order execution of (3) and (4) in the AM-based case, but (3) must be done when program loads dest.

Reading again the slides @anshumang used in WG calls, I understood that the proposal is to require fence() (memory barrier in this case ?) to order the completion of two blocking gets on local PE (seems needed only for AM-case). But such out-of-order seems never visible to single-threaded user program.

Now thinking about the threaded program, where load dest maybe performed by another core, thus such unordered gets becomes visible to user program. But do we always need additional cache coherence synchronization between T0 and T1 in this case ?

T0                                       T1
shmem_get(dest1);                    
shmem_get(dest2);
                                         load dest2;
                                         load dest1;

anshumang commented 6 years ago

Thanks for the code examples @minsii I have created issue #233 for tracking ordering of gets. Maybe, we can continue the discussion there? I have copied your example under #233

anshumang commented 6 years ago

Slides discussed in OpenSHMEM 2018 F2F

anshumang commented 6 years ago

References from the MPI RMA memory model and a generalized RMA memory model (coreRMA).

anshumang commented 5 years ago

Keynote by Will Deacon from OpenSHMEM Workshop 2018

openshmem-org / specification

Memory model umbrella ticket #229