Need for local completion and remote commit

naveen-rn commented 3 years ago

Motivation

In general, implementing shmem_quiet based memory ordering semantics is expensive. With the introduction of system processors with weak memory model, and support for multiple NICs per node, the cost of performing remote completion and committing any previously posted RMA and AMO events is getting really expensive. This introduces the need for performing dummy read-like operations to commit any outstanding operations into the remote targets memory.

Solution

As part of this proposal, we would like to introduce explicit options to perform local completion in OpenSHMEM. To complete the API we also would like to introduce the option to explicitly perform the remote commit operation. We can implement the existing shmem_quiet semantics as a combination of the local completion and remote commit operation.

Proposed API

The following new routines are proposed:

# Additions to OpenSHMEM Memory Ordering Operations
void shmem_local_complete(void);
void shmem_ctx_local_complete(shmem_ctx_t ctx);
void shmem_remote_commit(void);
void shmem_ctx_remote_commit(shmem_ctx_t ctx);

# Additions to OpenSHMEM collective operations
void shmem_team_remote_commit(shmem_team_t team);

API Semantics

`shmem_local_complete` and `shmem_ctx_local_complete`

The shmem_local_complete routine ensures the local completion of all operations on symmetric data objects issued by the calling PE on a given context. By local completion, the shmem_local_complete routine ensures the completion of all previously posted operations on symmetric data objects, but it does not guarantee any visibility of those operations when it returns from shmem_local_complete. With the local completion the symmetric data objects from all previously posted operations are ready to be reusable for performing other operations.

`shmem_remote_commit` and `shmem_ctx_remote_commit`

The shmem_remote_visible routine ensures the global visibility of all previously locally completed operations. It is to be noted that, this routine ensure only global visibility of only the previously locally completed operation. The local completion can be attained implicitly through the OpenSHMEM routines (like blocking put and AMO) or explicitly calling the shmem_local_complete operations.

`shmem_team_remote_commit`

This is a collective variant of the shmem_remote_commit operation. This routine registers the arrival of a PE at a shmem_team_remote_commit operation and blocks the PE until all other PEs arrive at the same shmem_team_remote_commit operation and also ensures that any locally completed operation on all PEs are made globally visible

nspark commented 3 years ago

static long target;
long base;
shmem_atomic_fetch_add_nbi(ctx, &base, &target, value, target_pe);

shmem_ctx_local_complete(ctx);
// The 'base' object has been updated on the calling PE.

shmem_ctx_remote_commit(ctx);
// The update to 'target' is now visible in memory on the target PE.

naveen-rn commented 3 years ago

Some examples to clarify the local complete and remote commit semantics:

1. shmem_put_nbi
2. shmem_remote_commit // remote commit is a no-op here - local completion of previous put is not provided

1. shmem_put_nbi
2. shmem_local_complete
3. shmem_remote_commit // remote commit guarantees global visibility of target buffer from step(1)

1. shmem_put
2. shmem_remote_commit // remote commit guarantees global visibility of target buffer from step(1) 
                       // because, implicit local completion is available as part of blocking put operation

1: shmem_put_nbi
2: shmem_local_complete
3. shmem_put
4. shmem_remote_commit  // target buffers from step(1) and (3) are made globally visible
                        // because, implicit local completion for blocking put in step(3) and explicit local
                        // completion in step(2) for nbi put operation in step(1) are available

1: shmem_put_nbi
2. shmem_put
3. shmem_remote_commit  // target buffer only from step(2) is globally visible and not from step(1)
                        // implicit local complete semantics in blocking put does not guarantee local completion 
                        // from other operations

1. shmem_get_nbi
2. shmem_local_complete // guarantees the availability of received value with return from local complete
                        // local completion of the get operation guarantees the actual completion of operation

// Nick's example
1. shmem_atomic_fetch_add
2. shmem_local_complete // fetched value is made available on returning from local complete
                        // but global visibility of target buffer from the AMO is not guaranteed
3. shmem_remote_commit  // global visibility of target buffer from the AMO is guaranteed

manjugv commented 3 years ago

"2. shmem_local_complete // fetched value is made available on returning from local complete
                        // but global visibility of target buffer from the AMO is not guaranteed"

FYI - From implementation perspective, this requires remote completion and it will have a latency of remote completion.

naveen-rn commented 3 years ago

FYI - From implementation perspective, this requires remote completion and it will have a latency of remote completion.

@manjugv Does that mean - every FAMO in your implementation provides global visibility guarantees? If so, aren't you providing more guarantees than what OSM-1.5 expects?

AFAIU, a local completion operation is not used to create delayed execution. That is for the shmem_session to handle. It just provides a way for delayed remote completion.

Meaning, you can try to implement all NBI and blocking operation by maintaining a local staging buffer. But, you would need to definitely need to post all these operations from the local staging buffer into the NIC during local_complete and make sure it has reached a state in the NIC, where it is safe from retransmission request.

nspark commented 3 years ago

I was thinking about this proposal today; in particular, how it seems to give rise to a set of "equivalences:"

shmem_put ≡ shmem_put_nbi + shmem_local_complete
shmem_quiet ≡ shmem_local_complete + shmem_remote_commit
shmem_barrier_all ≡ shmem_quiet + shmem_sync_all ≡ shmem_local_complete + shmem_remote_commit + shmem_team_sync ≡ shmem_local_complete + shmem_team_remote_commit

On one hand, I think that thinking about how existing OpenSHMEM operations can be translated into equivalent forms could be helpful. On the other hand, I think the put_nbi + put + remote_commit example is a good counter example that shows the limited "scope" of the put → put_nbi + local_complete equivalence.

nspark commented 3 years ago

Separately, I'm a little nervous that we're adding complexity here that may be hard to reconcile with any eventual memory model. I think we had a reasonably clear mapping of AMOs and fence/quiet to the C++ memory model. I feel less confident about the mapping in terms of local_complete and remote_commit.

manjugv commented 3 years ago

RDMA flush proposal: https://tools.ietf.org/id/draft-talpey-rdma-commit-01.html#rfc.section.3.1.1

nspark commented 3 years ago

On today's call, it seemed like:

Not everyone loves the name shmem_local_complete, but most generally support the concept.
Not everyone thinks that splitting shmem_quiet into the semantic pieces of shmem_local_complete + shmem_remote_commit provides a benefit.

While I understand @naveen-rn's rationale for all three new APIs, I wonder whether this issue—in particular, the need for an efficient successor to shmem_ctx_quiet + shmem_team_sync—is best handled by focusing primarily on the team-based synchronization aspect.

It seems to me (perhaps naively) that this issue could really be two mostly independent features: shmem_local_complete (or some renamed variant) and shmem_team_barrier. If anything, the originating motivation seems to be primarily for the latter.

nspark commented 3 years ago

Separately, there was a lot of discussion about completion semantics and how they're implemented. As an application user, I feel like libfabric has reasonably understandable language regarding completion semantics. (See "Completion Event Semantics" under man fi_cq.) In my understanding,

FI_INJECT_COMPLETE ≈ what we call "local completion" for puts
FI_DELIVERY_COMPLETE ≈ what we call "remote completion" for puts + "local completion" for gets
FI_COMMIT_COMPLETE ≈ RDMA flush (which mashes persistence and global visibility together)
OpenSHMEM doesn't have anything analogous to libfabric's FI_TRANSMIT_COMPLETE and FI_MATCH_COMPLETE

Likely someone can correct me, but it doesn't seem like libfabric has anything quite analogous to shmem_quiet's "globally visible" requirement—unless that's FI_COMMIT_COMPLETE but for non-persistent memory.

naveen-rn commented 3 years ago

The status of this PR as of June, 25 - before the Spec Meeting:

It is good to split the shmem_quiet semantics
There was good acceptance of the shmem_local_complete semantics, though the name of the routine is still being discussed
shmem_remote_commit seems not that really useful - no pressing use case
shmem_team_remote_commit is option:1 to address the deprecated shmem_barrier routine
shmem_team_barrier with similar semantics as shmem_barrier but the flush semantics available only on shared contexts and not on private contexts is option:2 to address the deprecated shmem_barrier routine

openshmem-org / specification