MPI 3.0 Errata: MPI RMA synchronization for MPI shared memory

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2014-08-13 17:44:34 -0500

Ticket-Overview

-If voted in, this ticket will replace all tickets mentioned below, i.e. #437, #413, #436, #435, #429, #434.*

Related errata tickets on the definition of MPI shared memory windows:

437 - a first attempt to bring all shared memory errata togeter.
413 - Add example for use of shared memory [[BR]]

Passed. This example shows the needed memory fences with MPI_WIN_SYNC when using only local and remote load/store, i.e., programming an application data exchange without any RMA communication calls.
436 - MPI 3.0 Errata: MPI Unified window clarification [[BR]]

The missing binding text for the example in #413 (with status "passed").
435 - MPI 3.0 Errata: MPI Active Target synchronization requirements for shared memory [[BR]]

This ticket #437 is intended to replace #435.
429 - Clarification on assertions on shared memory windows [[BR]]

Need to be discussed after this ticket.
434 - MPI 3.0 Errata: MPI Active Target synchronization requirements [[BR]]

technically obsolete - can be ignored for this discussion. and reviewed and published papers:
Torsten Hoefler, James Dinan, Darius Buntinas, Pavan Balaji, Brian Barrett, Ron Brightwell, William Gropp, Vivek Kale, Rajeev Thakur: [[BR]] "MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared Memory" [[BR]] at EuroMPI 2012.
Description

The MPI-3.0 standard says the following on page 441, lines 44-47.

"A fence call usually entails a barrier synchronization: a process completes a call to MPI_WIN_FENCE only after all other processes in the group entered their matching call. However, a call to MPI_WIN_FENCE that is known not to end any epoch (in particular, a call with assert equal to MPI_MODE_NOPRECEDE) does not necessarily act as a barrier."

That is, a fence is not necessarily a barrier synchronization.
For post-start-complete-wait, there is no specified requirement that the post and start calls need to synchronize, but p442:31-33 says:

"MPI_WIN_START is allowed to block until the corresponding MPI_WIN_POST calls are executed, but is not required to."

MPI-3.0 p441:34-35 defines

"RMA operations on win started by a process after the fence call returns will access their target window only after MPI_WIN_FENCE has been called by the target process."

If a remote load/store on shared memory is not treated as an RMA operation, the fence will not synchronize a sender process issuing local store before fence and a receiver process issuing a remote load to the same memory location after the fence.

MPI3.0 p442:28-33 defines MPI_Win_start:

"Starts an RMA access epoch for win. RMA calls issued on win during this epoch must access only windows at processes in group. Each process in group must issue a matching call to MPI_WIN_POST. RMA accesses to each target window will be delayed, if necessary, until the target process executed the matching call to MPI_WIN_POST."

If a remote load/store on shared memory is not treated as an RMA operation, then remote load/store are not valid between MPI_Win_start and MPI_Win_complete, and the post-start syncronization will not synchronize a sender process issuing a local store before post and receiver process issuing a remote load to the same memory Location after the start operation.

MPI-3.0 p453:44 - p454.3 rules 2 and 3:

"2. If an RMA operation is completed at the origin by a call to MPI_WIN_FENCE then the operation is completed at the target by the matching call to MPI_WIN_FENCE by the target process."

"3. If an RMA operation is completed at the origin by a call to MPI_WIN_COMPLETE then the operation is completed at the target by the matching call to MPI_WIN_WAIT by the target process."

If a remote load/store on shared memory is not treated as an RMA operation, then a remote store before fence or complete at a sender process will not be synchronized with a local load after the matching fence or wait at the receiver process.

Such synchronizing behavior of remote and local load/stores on shared Memory windows was expected in the paper published at EuroMPI 2012 by several members of the RMA WG: Torsten Hoefler, James Dinan, Darius Buntinas, Pavan Balaji, Brian Barrett, Ron Brightwell, William Gropp, Vivek Kale, Rajeev Thakur: "MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared Memory".

There are two options to fix this problem:

A) To define that remote load and store on a shared memory window is treated as an RMA operation.[[BR]] This would imply that all one-sided sync primitives must explicit synchronize.

B) To define that a remote and a local store or load are treated not as RMA Operation and to explicitely define additional process-synchronization behavior of some one-sided sync routines.[[BR]]

The proposed solution below is based on option B. It defines additional shared memory rules for fence, post-start-complete-wait, lock/unlock, and win_sync for combining process-to-process synchronization with load and stores by several processes on the same memory location within MPI shared memory windows.

Extended Scope

MPI-3.0 errata.

History

Detected by Hubert Ritzdorf.

Proposed Solution (as MPI-3.0 errata)

'''MPI-3.0 Sect.11.2.3 on MPI_WIN_ALLOCATE_SHARED, page 409, lines 13-22 reads

This is a collective call executed by all processes in the group of comm. On each process i, it allocates memory of at least size bytes that is shared among all processes in comm, and returns a pointer to the locally allocated segment in baseptr that can be used for load/store accesses on the calling process. The locally allocated memory can be the target of load/store accesses by remote processes; the base pointers for other processes can be queried using the function MPI_WIN_SHARED_QUERY. The call also returns a window object that can be used by all processes in comm to perform RMA operations. The size argument may be different at each process and size = 0 is valid. It is the user's responsibility to ensure that the communicator comm represents a group of processes that can create a shared memory segment that can be accessed by all processes in the group.

-but should read*

This is a collective call executed by all processes in the group of comm. On each process i, it allocates memory of at least size bytes that is shared among all processes in comm, and returns a pointer to the locally allocated segment in baseptr that can be used for load/store accesses on the calling process. The locally allocated memory can be the target of load/store accesses by remote processes. The rules for RMA operations do not apply to these remote load/store accesses; additional rules apply, see, e.g., Section 11.5.4A and 11.5.5 on pages 451 and 451.
The base pointers for other processes can be queried using the function MPI_WIN_SHARED_QUERY. The call also returns a window object that can be used by all processes in comm to perform RMA operations. The size argument may be different at each process and size = 0 is valid. It is the user's responsibility to ensure that the communicator comm represents a group of processes that can create a shared memory segment that can be accessed by all processes in the group.

-Remark:* New Section 11.5.4A is defined below.

-MPI-3.0 same section, page 410 lines 15-21 read*

The consistency of load/store accesses from/to the shared memory as observed by the user program depends on the architecture. A consistent view can be created in the unified memory model (see Section 11.4) by utilizing the window synchronization functions (see Section 11.5) or explicitly completing outstanding store accesses (e.g., by calling MPI_WIN_FLUSH). MPI does not define semantics for accessing shared memory windows in the separate memory model.

-but should read*

The consistency of load/store accesses from/to the shared memory as observed by the user program depends on the architecture. A consistent view can be created in the unified memory model (see Section 11.4) by utilizing the window synchronization or flush functions (see Section 11.5.4A)~~ or explicitly completing outstanding store accesses (e.g., by calling MPI_WIN_FLUSH)~~. MPI does not define semantics for accessing shared memory windows in the separate memory model.

-MPI-3.0 Section 11.4 Memory Model, page 436, lines 37-48 read*

In the RMA unified model, public and private copies are identical and updates via put or accumulate calls are eventually observed by load operations without additional RMA calls. A store access to a window is eventually visible to remote get or accumulate calls without additional RMA calls. These stronger semantics of the RMA unified model allow the user to omit some synchronization calls and potentially improve performance.

-Advice to users.* If accesses in the RMA unified model are not synchronized (with
locks or flushes, see Section 11.5.3), load and store operations might observe changes
to the memory while they are in progress. The order in which data is written is not
specified unless further synchronization is used. This might lead to inconsistent views
on memory and programs that assume that a transfer is complete by only checking
parts of the message are erroneous. *(End of advice to users.)*

-but should read*

In the RMA unified model, public and private copies are identical and remote updates via put or accumulate calls or stores by other processes (in the case of windows allocated with MPI_WIN_ALLOCATE_SHARED) are eventually observed by load operations without additional RMA calls. A store access to a window is eventually visible to remote get or accumulate calls or loads by other processes without additional RMA calls. These stronger semantics of the RMA unified model allow the user to omit some synchronization calls and potentially improve performance. The visibility of loads and stores in several processes may not be at the same time and not in the same sequence, see Section 11.5.4A on page 451.

-Advice to users.* If accesses in the RMA unified model are not synchronized (with
locks or flushes, see Section 11.5.3), load and store operations might observe changes
to the memory while they are in progress. The order in which data is written is not
specified unless further synchronization is used. This might lead to inconsistent views
on memory and programs that assume that a transfer is complete by only checking
parts of the message are erroneous.
__The only consistent view is that each process will always
perceive its own memory accesses as occurring in program order.__
-(End of advice to users.)*

__*Rationale.*__ If two processes on a ccNUMA node access the same 
memory location with an intermediate process-to-process
synchronization then the outcome is still undefined
because the first memory operation need not be 
finished before the second memory operation starts.
A well-defined execution may require the following sequence: [[BR]]
` `1. the memory access on the first process; [[BR]]
` `2. a local memory barrier to guarantee that the memory
operation is finished; [[BR]]
` `3. the process-to-process synchronization (e.g. send/recv)
to inform the second process; [[BR]] 
` `4. a local memory barrier on the second process to guarantee
that subsequent memory operations are issued to the memory; [[BR]]
` `5. the memory operation on the second process. [[BR]]
To guarantee the outcome of memory accesses in combination
with the synchronization as described in Section 11.5.4A on
page 451, the synchronization operation
will issue such local memory barrier if needed. See [45A], Chapter 14.
__*End of rationale.*__

-Remark:* ''I do not describe "local" and "remote" loads/stores. I describe loads and stores "from different processes". Reason: the memory is not located at one process; all processes have same relation to a shared memory in the unified model, independent of the portion that was defined by some process in a window.''

-2nd Remark:* New Section 11.5.4A is defined below. New reference [45A], see bibliography below.

-'MPI-3.0 page 451, before Section 11.5.5 Assertions, the following new section shhould be added:*

-11.5.4A Shared Memory Synchronization*

In the case of an MPI shared memory window (i.e., allocated with MPI_WIN_ALLOCATE_SHARED) additional rules apply for synchronizing load and store accesses from several processes to the same location. A location must be at least a byte, with the exception that bit-fields are not supported.

-Rationale.* For adjacent variables or struct members
that are not bitfields, the C11/C++11 specification requires
an implementation to refrain from generating stores that
cross into the memory range of another variable or
struct member. *(End of rationale.)*

In the following patterns, locations in a shared memory window are noted with variable names A, B, C, loads from such windows are noted with load(...) and stores are noted by assignments to these variables.

Patterns with active target communication and MPI_Win_sync:

   process P0          process P1

   A=val_1
   Sync-to-P1      --> Sync-from-P0
                       load(A)

   load(B)
   Sync-to-P1      --> Sync-from-P0
                       B=val_2

   C=val_3
   Sync-to-P1      --> Sync-from-P0
                       C=val_4
                       load (C)

with the notation
   "Sync-to-P1      --> Sync-from-P0"
can be any of the following synchronization patterns:

1. MPI_Win_fence   --> MPI_Win_fence   1)
2. MPI_Win_post    --> MPI_Win_start   2)
3. MPI_Win_complete--> MPI_Win_wait    3)
4. MPI_Win_sync
   Any-process-sync-
    -from-P0-to-P1 --> Any-process-sync-
                        -from-P0-to-P1 4)
                       MPI_Win_sync

Here, 3 access patterns are combined with 4 synchronization patterns generating 12 patterns in total.

Footnotes: 1) MPI_Win_fence synchronizes in both directions and between every process in the process group of the window. [[BR]] 2) The arrow means that P1 is in the origin group passed to MPI_Win_post in P0, and that P0 is in the target group passed to MPI_Win_start. [[BR]] Additional calls to MPI_Win_complete (in P1 after MPI_Win_start) and MPI_Win_wait (in P0 after MPI_Win_post) are needed. The location of these calls do not influence the guaranteed outcome rules. [[BR]] 3) The arrow means that P1 is in the target group passed to MPI_Win_start that corresponds to MPI_Win_complete in P0, and P0 is in the origin group passed to MPI_Win_post that corresponds to MPI_Win_wait in P1. Additional calls to MPI_Win_start (in P0 before MPI_Win_complete) and MPI_Win_post (in P1 before MPI_Win_wait) are need. The location of these calls do not influence the guaranteed outcome rules. [[BR]] 4) "Any-process-sync" may be done with methods from MPI (e.g. with send-->recv as in Example 11.13, but also with some synchronization through MPI shared memory loads and stores as in Example 11.14). The requirements for using MPI_Win_sync (e.g. within a passive target epoch which may be provided with MPI_Win_lock_all) are not shown in this pattern. Examples with MPI_Win_lock_all are provided in Examples 11.13 and 11.14 on page 461.

Patterns with lock/unlock synchronization:

Within passive target communication, two locks L1 and L2 may be scheduled as L1 released before L2 granted or -L2 released before L1 granted* in the case of two locks with at least one exclusive, or in the case of any lock with an additional synchronization (e.g., with point-to-point or collective communication) in between. In the following patterns, the arrow means that the lock in P0 was released before the lock in P1 was granted, independent of the method how this schedule is achieved.

   process P0          process P1

   MPI_Win_lock
   A=val_1
   MPI_Win_unlock  --> MPI_Win_lock 
                       load(A)
                       MPI_Win_unlock

   MPI_Win_lock
   load(B)
   MPI_Win_unlock  --> MPI_Win_lock
                       B=val_2
                       MPI_Win_unlock

   MPI_Win_lock
   C=val_3
   MPI_Win_unlock  --> MPI_Win_lock
                       C=val_4
                       load (C)
                       MPI_Win_unlock

Note that each rank of a window is connected to a separate lock. In a shared window, these locks are not connected to specific memory portions of the shared window, i.e., each lock can be used to protect any portion of a shared memory window.

In both pattern groups above, the active target communication and MPI_WIN_SYNC patterns, and the lock/unlock patterns, it is guaranteed

that the load(A) in P1 loads val_1 (this is the write-read-rule),
that the load(B) in P0 is not affected by the store of val_2 in P1 (read-write-rule), and
that the load(C) in P1 loads val_4 (write-write-rule).

This section adds additional rules about loads and stores by different processes to the same memory location in shared memory windows in combination with most synchronization calls, but not with MPI_WIN_FLUSH(_LOCAL)(_ALL) calls; this may be added in a future version of MPI.

For a correct compilation of MPI shared memory accesses, it is required that the compiler provides the memory model defined in C11/C++11. The companion Fortran compiler must provide at least the same memory model.

-Rationale.* Compilers that do not provide the memory model
of C11/C++11 may cause invalid execution of MPI shared memory
accesses because optimizations may cause the problems decribed
in [4A] Sections 4.2 and 4.3. *(End of rationale.)*

-Remark 1:* I do not fully understand which rules should be guaranteed for

_MPI_Win_flush (and MPI_Win_flushall)
_MPI_Win_flush_local (and MPI_Win_flush_localall) _I let this open for future versions of MPI. I expect that the classical Fence, Post-Start-Complete-Wait, and Lock-Unlock and the most important MPI_Winsync are enough for the moment.

-Remark 2:* ''It is not clear for me whether every hardware is able to store a byte with only one operation, this means, whether my proposed sentence "A location must be at least a byte" is okay or not.''

-Remark:* ''About Sect. 11.5.5 Assertions - MPI_MODE_NOSTORE:

I expect that the proposal in #429 is not helpful: it enlarges the hint about "not updated by stores since last synchronization" from local stores to all stores by the whole process group. The reason for this hint is to prevent "cache synchronization".

This cache synchronization is local and therefore the remote stores do not count.

The new proposal is simple:''

-_MPI-3.0 Section 11.5.5 Assertions on MPI_WINPOST, page 452 line 1-3 read*

MPI_MODE_NOSTORE -- the local window was not updated by stores (or local get
or receive calls) since last synchronization. This may avoid the need for cache
synchronization at the post call.

-but should read*

MPI_MODE_NOSTORE -- the local window was not updated by __local__ stores (or local get
or receive calls) since last synchronization. This may avoid the need for cache
synchronization at the post call.
__In the case of a shared memory window (i.e., allocated 
with MPI_WIN_ALLOCATE_SHARED), such local stores 
can be issued to any portion of the shared memory window.__

-_MPI-3.0 Section 11.5.5 Assertions on MPI_WINFENCE, page 452 line 9-10 read*

MPI_MODE_NOSTORE -- the local window was not updated by stores (or local get
or receive calls) since last synchronization.

-but should read*

MPI_MODE_NOSTORE -- the local window was not updated by __local__ stores (or local get
or receive calls) since last synchronization.
__In the case of a shared memory window (i.e., allocated 
with MPI_WIN_ALLOCATE_SHARED), such local stores 
can be issued to any portion of the shared memory window.__

-Reason for the added sentence:* ''Nobody should think that "local" means "only to the window portion that was defined in the local MPI_WIN_ALLOCATE_SHARED".''

-The passed ticket #413 added the following advice in MPI-3.0, Sect. 11.7, page 457, after line 3:*

-Advice to users. In the unified memory model, in the case where the window is in shared memory, MPI_WIN_SYNC can be used to order store operations and make store updates to the window visible to other processes and threads. Use of this routine is necessary to ensure portable behavior when point-to-point, collective, or shared memory synchronization is used in place of an RMA synchronization routine.
MPI_WIN_SYNC should be called by the writer before the non-RMA synchronization operation and by the reader after the non-RMA synchronization, as shown in Example 11.13 on page .... -(End of advice to users.)

-This advice is withdrawn because it is superseded by the new Section 11.5.4A.*

'''The passed ticket #413 added the following example at the end of MPI-3.0 Sect. 11.7, i.e., after page 461 line 20. This example correctly reflects the write-read-rule. It is modified to also correctly reflect the read-write-rule that applies from one iteration to the next.'''

-Example 11.13* The following example demonstrates the proper synchronization in the unified memory model when a data transfer is implemented with load and store in the case of windows in shared memory (instead of MPI_PUT or MPI_GET) and the synchronization between processes is performed using point-to-point communication. The synchronization between processes must be supplemented with a memory synchronization through calls to MPI_WIN_SYNC, which act locally as a processor-memory barrier.
In Fortran, if MPI_ASYNC_PROTECTS_NONBLOCKING is .FALSE. or the variable X is not declared as ASYNCHRONOUS, reordering of the accesses to the variable X must be prevented with MPI_F_SYNC_REG operations. (No equivalent function is needed in C.)

The variable X is contained within a shared memory window and X corresponds to the same memory location at both processes. The MPI_WIN_SYNC operation performed by process A ensures completion of the load/store operations issued by process A. The MPI_WIN_SYNC operation performed by process B ensures that process A's updates to X are visible to process B (write-read-rule). The second pair of MPI_WIN_SYNC is needed due to the read-write-rule that applies from the load(X) within the print X in one iteration to the X=... assignment in the next iteration of the loop.

  Process A               Process B

  MPI_WIN_LOCK_ALL(       MPI_WIN_LOCK_ALL(
  MPI_MODE_NOCHECK,win)   MPI_MODE_NOCHECK,win) 

  DO ...                  DO ...
   X=...

   MPI_F_SYNC_REG(X)
   MPI_WIN_SYNC(win)    
   MPI_SEND                MPI_RECV
                           MPI_WIN_SYNC(win)
                           MPI_F_SYNC_REG(X)

                           print X

                           MPI_F_SYNC_REG(X)
                           MPI_WIN_SYNC(win)
   MPI_RECV                MPI_SEND
   MPI_WIN_SYNC(win)
   MPI_F_SYNC_REG(X)                       
  END DO                  END DO

  MPI_WIN_UNLOCK_ALL(win) MPI_WIN_UNLOCK_ALL(win)

-After the example above, the following new example should be also added:*

-Example 11.14* The following example demonstrates the pairing of MPI_WIN_SYNC (i.e., of memory barriers) as in Example 11.13, but without further calls to MPI synchronization or communication routines. Variables A and X are within one or two shared memory windows. They are initialized with zero and this value is already visible in both processes.

  Process A                  Process B

  MPI_WIN_LOCK_ALL(          MPI_WIN_LOCK_ALL(
  MPI_MODE_NOCHECK,win_A)    MPI_MODE_NOCHECK,win_A) 

  A=val_1

  MPI_F_SYNC_REG(A)
  MPI_WIN_SYNC(win_A)    
  X=1
  MPI_F_SYNC_REG(X)          MPI_F_SYNC_REG(X)
                             WHILE (.NOT.(load(X) == 1)) DO
                               ! MPI_WIN_SYNC(win_B) is not needed
                               MPI_F_SYNC_REG(X)
                             END DO
                             MPI_WIN_SYNC(win_A)
                             MPI_F_SYNC_REG(A)

                             load(A)

  MPI_WIN_UNLOCK_ALL(win_A)  MPI_WIN_UNLOCK_ALL(win_A)

The load(A) in process B must return val_1, because two rules apply together: (1) the write-read-rule in Section 11.5.4A and that the store X=1 in process A must be eventually visible in Process B without further RMA calls. The store and load of X is used as the synchronization required for the write-read-rule. The read-write and write-write patterns of Section 11.5.4A can be implemented in the same way.

Note that in the programming language C/C++, the load(X) access and the store access X=1 must be imlemented as a volatile access, e.g., with

  int load(int *X) { return *(volatile int*)X; }
  and using (load(&X)==1)

Note that in Fortran, A and X should be declared as ASYNCHRONOUS, and the MPI_F_SYNC_REG(v) with v being A and X should normally be substituted by

  IF (.NOT.MPI_ASYNC_PROTECTS_NONBLOCKING) &
  & CALL MPI_F_SYNC_REG(v)

Within the WHILE loop, it is necessary to at least use unconditionally MPI_F_SYNC_REG or to declare X within "load(X)" as volatile.

-MPI-3.0 Bibliography, add after [4]:*

[4A] Hans-J. Boehm. Threads Cannot be Implemented as a Library. HP Laboratories Palo Alto, report HPL-2004-2092004, 2004.[[BR]] http://www.hpl.hp.com/techreports/2004/HPL-2004-209.pdf

-MPI-3.0 Bibliography, add after [45]:*

[45A] Paul E. McKenney (ed.). Is Parallel Programming Hard, And, If So, What Can You Do About It? First Edition, Linux Technology Center, IBM Beaverton, March 10, 2014. [[BR]] http://kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook-e1.pdf

Impact on Implementations

Some implementations (e.g., MPICH) will need to add an extra synchronization to enable correct behavior as intended above.

Impact on Applications / Users

Correct MPI programs would have gotten incorrect results in the past.
They would now function correctly.

Entry for the Change Log

Section 11.5.4A on page 451. [[br]] New synchronization rules added for shared memory windows.

mpiforumbot commented 8 years ago

Originally by jhammond on 2014-08-27 14:03:07 -0500

Cleanup formatting of quotation from standard.

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2014-08-28 12:40:32 -0500

Thank you to Jed Brown, Jeff Hammond and Dave goodell for there comments in individual emails. I updated the ticket according to their comments.

mpiforumbot commented 8 years ago

Originally by jhammond on 2014-08-28 12:56:58 -0500

What does "Sync-to-Pn" mean? MPI_WIN_SYNC takes a window as the argument and it has the effect of reconciling the public and private window at a given MPI process.

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2014-08-28 13:06:35 -0500

"Sync-to-P1 --> Sync-from-P0" is an abbrevation for each of the 4 cases shown at the end of the box. I made this to reduce the writing of 3 x 4 cases to the 3+4 essentials.

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2014-08-28 15:35:26 -0500

I substituted the last paragraph of new 11.5.4A

Further synchronization patterns for loads and stores by different processes to the same memory location in a shared memory window, e.g., with MPI_WIN_FLUSH, MPI_WIN_FLUSH_ALL, MPI_WIN_FLUSH_LOCAL and MPI_WIN_FLUSH_LOCAL_ALL, are not defined, but may be defined in a future version of this standard.

by

This section adds additional rules about loads and stores by different processes to the same memory location in shared memory windows in combination with most synchronization calls, but not with MPI_WIN_FLUSH(_LOCAL)(_ALL) calls; this may be added in a future version of MPI.

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2014-08-29 11:03:15 -0500

Small textual corrections.

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2014-08-30 10:35:33 -0500

I updated the text on lock/unlock Patterns based on Jeff Hammond's examples https://github.com/jeffhammond/HPCInfo/blob/master/mpi/rma/shared-memory-windows/win_lock_shared.c and https://github.com/jeffhammond/HPCInfo/blob/master/mpi/rma/shared-memory-windows/win_lock_exclusive.c

mpiforumbot commented 8 years ago

Originally by rsthakur on 2014-09-02 04:34:32 -0500

Instead of saying:

"For post-start-complete-wait, there is no specified requirement that the post and start calls need to synchronize."

it is worth pointing out:

pg 442, ln 31-33: "MPI_WIN_START is allowed to block until the corresponding MPI_WIN_POST calls are executed, but is not required to."

mpiforumbot commented 8 years ago

Originally by gropp on 2014-09-03 06:23:29 -0500

While it is useful to consider all of the related changes (and this is a good start toward a chapter-based updates), I'd like to consider carefully which updates are necessary and which are optional. Errata about the behavior of MPI routines is a necessary update. Examples of the use of shared memory are not necessary (by definition, examples are not binding on the standard). My own preference is to limit the MPI specification to the behavior of the MPI routines; this might, for example, specify that the relevant MPI RMA synchronization routines, when applied to a window created with MPI_Win_allocate_shared, include the effect of a memory barrier (and this will need to be carefully defined). The behavior of code not written with MPI should not be defined by the MPI standard.

An advice to users could direct them to other resources about shared memory programming, including the papers that explain why library-based approaches are risky and fragile. I believe that going into much more detail is inappropriate in the standard (it is not a user manual; it is a description of the standard).

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2014-09-06 09:41:59 -0500

Replying to rsthakur:

Instead of saying:

"For post-start-complete-wait, there is no specified requirement that the post and start calls need to synchronize."

it is worth pointing out:

pg 442, ln 31-33: "MPI_WIN_START is allowed to block until the corresponding MPI_WIN_POST calls are executed, but is not required to."

I modified the description section to include your cited text. The solution is not modified.

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2014-09-06 10:10:26 -0500

Replying to gropp:

While it is useful to consider all of the related changes (and this is a good start toward a chapter-based updates), I'd like to consider carefully which updates are necessary and which are optional.

For me it is important that the correction is complete, i.e., contains all what would be in the MPI-Standard if the shared memory amendment in MPI-3.0 would have been carefully and complete.

Errata about the behavior of MPI routines is a necessary update. Examples of the use of shared memory are not necessary (by definition, examples are not binding on the standard).

I limited the examples to those that are necessary to understand what are the consequences of such interface definition. The first example was discussed in Detail with the RMA Group and I learnt now that it was still incorrect. Therefore the modifications.

The second example is also needed because all papers on shared memory define the behavior of memory barriers through ist outcome for paired memory barriers.

The solution has to define the outcome of many synchronization routines. As discussed in the telecon, I did not define "a memory barrier is..." and the MPI... synchronzation routine must add such a memory barrier at the beginning/end and some synchronization ... . I clearly defined the outcome of load/store in combination with the synchronization routine.

My own preference is to limit the MPI specification to the behavior of the MPI routines; this might, for example, specify that the relevant MPI RMA synchronization routines, when applied to a window created with MPI_Win_allocate_shared, include the effect of a memory barrier (and this will need to be carefully defined).

Yes, this careful definition is the new section 11.5.4A.

The behavior of code not written with MPI should not be defined by the MPI standard.

I expect, you reference to my sentence

"4) "Any-process-sync" may be done with methods from MPI (e.g. send-->recv) or with other methods."

Yes, I should change this into

'''4) "Any-process-sync" may be done with methods from MPI (e.g. send-->recv as in Example 11.13, but also some synchronization through MPI shared memory loads and stores as in Example 11.14).'''

I'll add this to the solution.

An advice to users could direct them to other resources about shared memory programming, including the papers that explain why library-based approaches are risky and fragile. I believe that going into much more detail is inappropriate in the standard (it is not a user manual; it is a description of the standard).

I'll check for you references in the emails.

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2014-09-29 04:28:28 -0500

In an email, Bill Gropp pointed me to two references about problems when providing shared memory through libraries, as with pthreads (or now with MPI-3 shared memory):

[A] Hans-J Boehm: "Threads Cannot be Implemented as a Library" (2004) [[BR]] http://www.hpl.hp.com/techreports/2004/HPL-2004-209.pdf [[BR]] and[[BR]] [B] Sarita V. Adve, Hans-J. Boehm: "You don’t know Jack About Shared Variables or Memory Models" (2011) [[BR]] http://queue.acm.org/detail.cfm?id=2088916

Especially [A] provides a list of serios problems:

[A] Section 3 discusses the need of memory barriers and that the compiler cannot move code across opaque function calls in C, i.e., the background of the current text of #456. The MPI synchronization calls are such opaque function calls in C.

-TODO*: We must check, whether in Fortran, the definition of ASYNCHRONOUS window variables guarantees that the MPI synchronization calls are opaque in the sense of Section 3 in [A].
[A] Section 4 discusses problems beyond the current text of #456, i.e., problems that exist because the compiler is allowed to do optimizations that does not change the sequential outcome of a code, but that modifies its outcome that is visible by another thread.
[A] Section 4.1 shows:

[[[if (x==1) ++y;]]]

may be translated into

[[[++y; if (x!=1) --y;]]]

Both versions are correct within the executing thread, but starting with both x and y zero, another thread will never see y==1 with the original code, but will see for a short time y==1 with the transformed code.

This problem only exists if two threads are accessing the same variable whithin the same access epoch and at least one of the accesses is a store. Separate access epochs must be separated through memory barrier / synchronization / memory barrier.

The accesses to variable X in Example 11.14 in this ticket #456 has such a conflict, i.e., X may be switched by process A into an intermediate state (as y in the example above), which may cause that process B can read a wrong value for a short timeframe.

-TODO*: To check, whether such optimizations are still allowed in C11. The provided email answer below does not say anything about this problem. If the problem is not resolved in C11 then the example can be corrected with "VOLATILE x".
[A] Section 4.2 discusses the problem of modifying neighbor data in memory, e.g., if storing values into a bitfield. C11/C++11 restricts this problem to bitfields, see email answer below. Store accesses to other data must not modify neighbor data.

-TODO*: Which memory semantics provides Fortran TS 29113 for ASYNCHRONOUS data.

-TODO*: Correcting the sentence "A location must be at least a byte" in the first paragraph of the new Section 11.5.4A of this #456, e.g.,

'''A location must be at least a byte, with the exception that bit-fields are not supported.'''

-Rationale.* '''For adjacent variables or struct members that are not bitfields, the C11/C++11 specification requires an implementation to refrain from generating stores that cross into the memory range of another variable or struct member.''' -**(End of rationale.)***
[A] Section 4.3 on register promotion:

The email answer below tells that this problem is resolved for C11. Therefore, I would add at the end of the new Section 11.5.4A of this #456:

'''For a correct compilation of MPI shared memory accesses, it is required that the compiler provides the memory model defined in C11/C++11. The companion Fortran compiler must provide at least the same memory model.'''

-Rationale. '''Compilers that do not provide the memory model of C11/C++11 may cause invalid execution of MPI shared memory accesses because optimizations may cause the problems decribed in [A] Sections 4.2 and 4.3.'''
-(End of rationale.)

Bill Long forwarded to me the following emails answer to my question whether the problems presented in [A] are resolved by current compilers:

"From a specification perspective, as far as I know no language specification prior to C11/C++11 requires an implementation to protect against the problems in Hans Boehm's paper. That means that those problems are technically possible, and if they occur then a user must figure out how to coerce their compiler to "play nice" (by changing optimization flags, inserting "suppresses", adding "volatile" keywords, etc). However, the C11/C++11 memory model specification clearly addresses these two issues.

For "rewriting of adjacent data" (Section 4.2) the memory model defines that adjacent bitfields are logically the same location (for purposes of the memory model), so programmers are responsible for avoiding concurrent read/write access (i.e., data races) on adjacent bitfields. For adjacent variables or struct members that are not bitfields, the specification requires an implementation to refrain from generating stores that cross into the memory range of another variable or struct member. That is, a compiler must either use precise hardware instructions (e.g., byte-level access) or insert sufficient padding between variables and/or struct members.

For "register promotion" (Section 4.3) the memory model essentially prohibits an implementation from generating a store to a shared variable where such a store doesn't exist in the original program order (i.e., source code). For the example in the paper the memory model would allow register promotion to occur between the synchronization points (the opaque function calls), where the global variable is written, but not before or after them, where the global variable is not written.

Given this, the only guaranteed protection that programmers have against these problems is to use a fully compliant C11 or C++11 compiler. CCE isn't there yet, but we're working on it. I'm not sure how far along other vendors are, but I suspect they are ahead of us. All bets are off for compilers that implement older language standards. Also, it's not clear to me that these problems are addressed in Fortran, but I suspect that any vendor using the same optimizer for multiple languages will follow the same memory-model rules for all of them.

On the other hand, from a practical perspective Rolf's assumption may have some merit. Vendors certainly have incentive to fix issues when users/customers complain, and a heck of a lot of production pthreads code is "out in the wild" and (apparently) running just fine. That anecdotal evidence suggests that for most vendors and most codes these issues really don't matter. Also, these issues have been known for a long time (Hans Boehm wrote that paper in 2004) and the memory model specifications have been under development for at least as long. So, I suspect some compilers began adopting the memory-model restrictions many years ago, at least at some optimization levels.

That being said, not observing errors for some vendors and some codes (for some executions) is by no means a guarantee of correctness. These kind of data-race problems can manifest themselves in very rare and subtle ways, allowing real bugs to remain undetected for a long time. Even if a data race occasionally causes bad answers or a program crash, the intermittent nature of the failure makes it very difficult to track down and fix. Similarly, not all compiler vendors or target architectures are guaranteed to behave the same -- a code that runs just fine with one compiler and processor combination may exhibit data races with a different compiler or processor. This problem is particularly interesting in a world dominated by x86, which provides very strong memory ordering guarantees -- porting parallel codes to a processor with a weaker memory model (e.g., arm or powerpc) can expose bugs that did not manifest on x86."

mpiforumbot commented 8 years ago

Originally by RolfRabenseifner on 2014-09-29 10:30:20 -0500

The proposed changes of comment:12 and comment:13 together with a list of corrections by Rajeev Thakur are included into the proposed solution.

The TODOs of comment:13 are still to do.

mpiforumbot commented 8 years ago

Originally by goodell on 2014-09-29 11:38:36 -0500

I'm concerned about adding this hacked-up text that relates to C11's memory model without anybody in the MPI Forum actually studying the C11 memory model. Furthermore, if we attempt to relate the MPI RMA semantics to the C11 MM semantics, we should use precise language as specified in the C11 standard. For example, it might be more precise to say that MPI_Win_sync has the same synchronization effects as atomic_thread_fence(memory_order_acq_rel) (though I have not fully evaluated this statement for correctness). This sort of work must be done very carefully and we should defer to the C11 standard where possible.

Rolf, I don't think your statement about volatile is correct:

TODO: To check, whether such optimizations are still allowed in C11. The provided email answer below does not say anything about this problem. If the problem is not resolved in C11 then the example can be corrected with "VOLATILE x".

For one thing VOLATILE is not a valid C keyword (case matters in C). But more concerning is that I don't think anything in the C99 standard is required to cause volatile to prevent the problematic optimization. It happens that, in practice, most compilers will disable a whole bunch of optimization around volatile variables, but it's not required in this case AFAICT.

mpiforumbot commented 8 years ago

Originally by gropp on 2014-09-29 11:57:04 -0500

I fully agree with Dave. As I've said before, MPI should only talk about MPI. What the compiler might do with non-MPI is not in scope for us. Yes, this makes it difficult to provide examples that are guaranteed to work; that's part of the point of Boehm's article. But that doesn't change the fact that the MPI Forum must not specify the behavior of C (or Fortran) codes that happen to be accessing shared memory.

mpiforumbot commented 8 years ago

Originally by gropp on 2014-12-10 13:23:19 -0600

The WG appears to favor requiring that MPI provide strong process and memory synchronization. The issues raised by Goodell have not been resolved. It was also noted that this may mandate significant overhead (due to expensive memory fences) that may not be required by the user's application. A partial fix for the overhead is to provide additional assert options so that the implementation, at the cost of additional branches, that the implementation could use to avoid some memory fences.

An alternative that would have used MPI for process synchronization (e.g., use Barrier or point-to-point message passing), along with the language-provided features for shared memory operations (available in, for example, C11), did not find favor with the WG.

mpi-forum / mpi-forum-historic