Open naveen-rn opened 3 years ago
static long target;
long base;
shmem_atomic_fetch_add_nbi(ctx, &base, &target, value, target_pe);
shmem_ctx_local_complete(ctx);
// The 'base' object has been updated on the calling PE.
shmem_ctx_remote_commit(ctx);
// The update to 'target' is now visible in memory on the target PE.
Some examples to clarify the local complete and remote commit semantics:
1. shmem_put_nbi
2. shmem_remote_commit // remote commit is a no-op here - local completion of previous put is not provided
1. shmem_put_nbi
2. shmem_local_complete
3. shmem_remote_commit // remote commit guarantees global visibility of target buffer from step(1)
1. shmem_put
2. shmem_remote_commit // remote commit guarantees global visibility of target buffer from step(1)
// because, implicit local completion is available as part of blocking put operation
1: shmem_put_nbi
2: shmem_local_complete
3. shmem_put
4. shmem_remote_commit // target buffers from step(1) and (3) are made globally visible
// because, implicit local completion for blocking put in step(3) and explicit local
// completion in step(2) for nbi put operation in step(1) are available
1: shmem_put_nbi
2. shmem_put
3. shmem_remote_commit // target buffer only from step(2) is globally visible and not from step(1)
// implicit local complete semantics in blocking put does not guarantee local completion
// from other operations
1. shmem_get_nbi
2. shmem_local_complete // guarantees the availability of received value with return from local complete
// local completion of the get operation guarantees the actual completion of operation
// Nick's example
1. shmem_atomic_fetch_add
2. shmem_local_complete // fetched value is made available on returning from local complete
// but global visibility of target buffer from the AMO is not guaranteed
3. shmem_remote_commit // global visibility of target buffer from the AMO is guaranteed
"2. shmem_local_complete // fetched value is made available on returning from local complete
// but global visibility of target buffer from the AMO is not guaranteed"
FYI - From implementation perspective, this requires remote completion and it will have a latency of remote completion.
FYI - From implementation perspective, this requires remote completion and it will have a latency of remote completion.
@manjugv Does that mean - every FAMO in your implementation provides global visibility guarantees? If so, aren't you providing more guarantees than what OSM-1.5 expects?
AFAIU, a local completion operation is not used to create delayed execution. That is for the shmem_session to handle. It just provides a way for delayed remote completion.
Meaning, you can try to implement all NBI and blocking operation by maintaining a local staging buffer. But, you would need to definitely need to post all these operations from the local staging buffer into the NIC during local_complete and make sure it has reached a state in the NIC, where it is safe from retransmission request.
I was thinking about this proposal today; in particular, how it seems to give rise to a set of "equivalences:"
shmem_put
≡ shmem_put_nbi
+ shmem_local_complete
shmem_quiet
≡ shmem_local_complete
+ shmem_remote_commit
shmem_barrier_all
≡ shmem_quiet
+ shmem_sync_all
≡ shmem_local_complete
+ shmem_remote_commit
+ shmem_team_sync
≡ shmem_local_complete
+ shmem_team_remote_commit
On one hand, I think that thinking about how existing OpenSHMEM operations can be translated into equivalent forms could be helpful. On the other hand, I think the put_nbi
+ put
+ remote_commit
example is a good counter example that shows the limited "scope" of the put
→ put_nbi
+ local_complete
equivalence.
Separately, I'm a little nervous that we're adding complexity here that may be hard to reconcile with any eventual memory model. I think we had a reasonably clear mapping of AMOs and fence/quiet to the C++ memory model. I feel less confident about the mapping in terms of local_complete
and remote_commit
.
RDMA flush proposal: https://tools.ietf.org/id/draft-talpey-rdma-commit-01.html#rfc.section.3.1.1
On today's call, it seemed like:
shmem_local_complete
, but most generally support the concept.shmem_quiet
into the semantic pieces of shmem_local_complete
+ shmem_remote_commit
provides a benefit.While I understand @naveen-rn's rationale for all three new APIs, I wonder whether this issue—in particular, the need for an efficient successor to shmem_ctx_quiet
+ shmem_team_sync
—is best handled by focusing primarily on the team-based synchronization aspect.
It seems to me (perhaps naively) that this issue could really be two mostly independent features: shmem_local_complete
(or some renamed variant) and shmem_team_barrier
. If anything, the originating motivation seems to be primarily for the latter.
Separately, there was a lot of discussion about completion semantics and how they're implemented. As an application user, I feel like libfabric has reasonably understandable language regarding completion semantics. (See "Completion Event Semantics" under man fi_cq
.) In my understanding,
FI_INJECT_COMPLETE
≈ what we call "local completion" for putsFI_DELIVERY_COMPLETE
≈ what we call "remote completion" for puts + "local completion" for getsFI_COMMIT_COMPLETE
≈ RDMA flush (which mashes persistence and global visibility together)FI_TRANSMIT_COMPLETE
and FI_MATCH_COMPLETE
Likely someone can correct me, but it doesn't seem like libfabric has anything quite analogous to shmem_quiet
's "globally visible" requirement—unless that's FI_COMMIT_COMPLETE
but for non-persistent memory.
The status of this PR as of June, 25 - before the Spec Meeting:
shmem_quiet
semanticsshmem_local_complete
semantics, though the name of the routine is still being discussedshmem_remote_commit
seems not that really useful - no pressing use caseshmem_team_remote_commit
is option:1 to address the deprecated shmem_barrier
routineshmem_team_barrier
with similar semantics as shmem_barrier
but the flush semantics available only on shared contexts and not on private contexts is option:2 to address the deprecated shmem_barrier
routine
Motivation
In general, implementing
shmem_quiet
based memory ordering semantics is expensive. With the introduction of system processors with weak memory model, and support for multiple NICs per node, the cost of performing remote completion and committing any previously posted RMA and AMO events is getting really expensive. This introduces the need for performing dummy read-like operations to commit any outstanding operations into the remote targets memory.Solution
As part of this proposal, we would like to introduce explicit options to perform local completion in OpenSHMEM. To complete the API we also would like to introduce the option to explicitly perform the remote commit operation. We can implement the existing
shmem_quiet
semantics as a combination of the local completion and remote commit operation.Proposed API
The following new routines are proposed:
API Semantics
shmem_local_complete
andshmem_ctx_local_complete
The
shmem_local_complete
routine ensures the local completion of all operations on symmetric data objects issued by the calling PE on a given context. By local completion, theshmem_local_complete
routine ensures the completion of all previously posted operations on symmetric data objects, but it does not guarantee any visibility of those operations when it returns fromshmem_local_complete
. With the local completion the symmetric data objects from all previously posted operations are ready to be reusable for performing other operations.shmem_remote_commit
andshmem_ctx_remote_commit
The
shmem_remote_visible
routine ensures the global visibility of all previously locally completed operations. It is to be noted that, this routine ensure only global visibility of only the previously locally completed operation. The local completion can be attained implicitly through the OpenSHMEM routines (like blocking put and AMO) or explicitly calling theshmem_local_complete
operations.shmem_team_remote_commit
This is a collective variant of the
shmem_remote_commit
operation. This routine registers the arrival of a PE at ashmem_team_remote_commit
operation and blocks the PE until all other PEs arrive at the sameshmem_team_remote_commit
operation and also ensures that any locally completed operation on all PEs are made globally visible