Open manjugv opened 5 years ago
I think this is an interesting feature, but that users need some additional information about the implementation to understand the cases in which shmem_pe_fence
might be advantageous over shmem_fence
. Specifically, the user application should be able to query when shmem_pe_fence
is equivalent to shmem_fence
, and when it is a lighter weight operation.
Consider a message-passing example in which a PE is sending, in order, data and a flag (like shmem_put_signal
) to multiple PEs. One would likely select different fencing behavior if shmem_pe_fence
is equivalent to shmem_fence
or not. For example:
void message_exchange(void *dst, const void *src, size_t nbytes,
uint64_t *flags, int *targets, int num_targets) {
// fence_pe < fence
for (int i = 0; i < num_targets; i++)
shmem_putmem(dstdata, srcdata, size, targets[i]);
for (int i = 0; i < num_targets; i++) {
shmem_pe_fence(targets[i]);
shmem_atomic_set(flag, 1, targets[i]);
}
// fence_pe == fence
for (int i = 0; i < num_targets; i++)
shmem_put(dstdata, srcdata, size, targets[i]);
shmem_fence();
for (int i = 0; i < num_targets; i++)
shmem_atomic_set(flag, 1, targets[i]);
}
@nspark Thanks for the nice example. I agree, the performance query will be helpful. Using shmem_putmem_nbi
might also help to illustrate the potential performance improvement --- finer-grain fencing will enable pipelining of the two loops in the network. That is, an implementation that waits for completion of some, rather than all of the puts may start issuing sets sooner and do a better job of hiding the fence latency.
for (int i = 0; i < num_targets; i++) {
shmem_pe_fence(targets[i]);
shmem_atomic_set(flag, 1, targets[i]);
}
If shmem_fence == shmem_pe_fence
, shmem_pe_fence(target[0])
will be expensive, shouldn't subsequent fences inside the loop be no-ops (just a single conditional check)? In general, I see the performance regression caused by introducing extra conditional checks - but not because of any heavy network ops. Maybe unnecessary memory fences be another issue if the target PEs are on the same node.
That said, I see shmem_pe_fence/quiet
not as routines to further obtain granularity in memory ordering (use context for that usage model) or replacements for contexts, but to provide hints to the implementations that can take advantage of any network semantics.
@naveen-rn Agree that optimization can be used. IIUC, it will require a flag to be set in RMA/AMO operations to track whether the PE is still in the fenced/quieted state. In a multithreaded PE, the flag would require atomics. So, it seems like there is a tradeoff to be made in terms of performance of per-pe fence/quiet and RMA/AMO rate for small messages.
For the NVSHMEM or NUMA case I mentioned it may not be so straightforward. It will be platform dependent whether O(N) finer-grain (e.g. GPU or Socket local) fence/quiet operations will be cheaper that O(1) system-level operation.
@nspark - Users would have to be really careful how they use it. If you use it without understanding system topology performance will be hurt. In case of shared memory we would have to place "global" memory barrier regardless what is used. If you have to do more than one fence/quite you better use regular/fence quite.
(I'm late to this).
The example above can pretty much be replaced by put w/signal. It's more interesting if there's multiple puts being fenced.
What about non-blocking fences? Has that been part of this discussion?
// fence_pe < fence
for (int i = 0; i < num_targets; i++){
shmem_putmem_nb(dstdata, srcdata, size, targets[i]);
shmem_putmem_nb(dstdata1, srcdata1, size1, targets[i]);
shmem_putmem_nb(dstdata2, srcdata2, size2, targets[i]);
shmem_pe_fence_nb(targets[i]);
shmem_atomic_set(flag, 1, targets[i]);
}
Actually, if it's non-blocking, does it matter as much if it's shmem_pe_fence or shmem_fence?
DMAPP did a "bundled" put where you could put multiple times with a flag on the last put, guaranteed to be delivered after the "bundle".
/* dmapp_bput_nbi - Non-blocking implicit bundled Put
*
* Parameters:
* IN target_addr Address of target buffer
* target_seg Segment descriptor of target buffer
* target_pe Target PE
* source_addr Address of source buffer
* nelems Number of elements to be transferred
* type Type of elements to be transferred
* flag_addr Address of (8-byte) target flag location (must be non-NULL for the last call in a bundle)
* flag_seg Segment descriptor for flag buffer (if NULL, assumed to be the same as target_seg)
* flag_value 64-bit user defined flag value
*
@bcernohous - Does it have to be put to the same PE or can be to different PEs ?
/* dmapp_bput_nbi - Non-blocking implicit bundled Put * * Parameters: * IN target_addr Address of target buffer * target_seg Segment descriptor of target buffer * target_pe Target PE * source_addr Address of source buffer * nelems Number of elements to be transferred * type Type of elements to be transferred * flag_addr Address of (8-byte) target flag location (must be non-NULL for the last call in a bundle) * flag_seg Segment descriptor for flag buffer (if NULL, assumed to be the same as target_seg) * flag_value 64-bit user defined flag value *
It was pretty limited. The same PE. And since it was internally an FMA chain, the chain would be broken by non-bundled puts.
I'm not sure it was ever used, but it was a requested experimental feature. Not proposing it for SHMEM, just reminded by the per-PE fence/flag example. There's an implicit PE fence on the last bundled put similar to put w/signal.
I'm really looking for clarification on the use case(s) driving this feature. The example was basically put w/signal.
What about non-blocking fences? Has that been part of this discussion?
No, but I also don't think there's anything about shmem_fence
that requires it to be blocking. The meat of what is said about shmem_fence
is that it "assures ordering". One way of assuring ordering is to call shmem_quiet
(and block); others might count ACKs for the issued puts. Even still, one might not need to block. An internal flag could be set that indicates a fence has been issued (and maybe even capture the list of PEs with outstanding operations). A subsequent put (to one of those PEs with outstanding operations) could then block in the put call (and not the earlier fence) waiting for the expected number of ACKs.
The example above can pretty much be replaced by put w/signal. It's more interesting if there's multiple puts being fenced.
That wasn't the point of the example, which acknowledged that shmem_put_signal
could be used there. The point is that applications (or their authors) will need to have some general sense of the relative cost of shmem_ctx_fence_pe
vs. shmem_ctx_fence
. If they're the same operation (or have nearly the same cost), the application will almost certainly choose a different synchronization pattern than if they're not. This could be tested empirically, but I would rather have the implementation inform the developer somehow.
Going further, for N == shmem_n_pes()
, an application should not expect that time(shmem_fence
) = N · time(shmem_fence_pe
). It could be nice, but it is unrealistic. I also don't think an implementation could really give consistent, meaningful relative costs for these operations. Where there's a difference, an application will like have to be tuned empirically to select one over the other.
It was pretty limited. The same PE. And since it was internally an FMA chain, the chain would be broken by non-bundled puts.
I'm not sure it was ever used, but it was a requested experimental feature. Not proposing it for SHMEM, just reminded by the per-PE fence/flag example. There's an implicit PE fence on the last bundled put similar to put w/signal.
I'm really looking for clarification on the use case(s) driving this feature. The example was basically put w/signal.
This feature is integral part of Verbs and IBTA spec (post list). It actually the default API for sending messages. It definitely has performance benefits and provides some opportunities to reduce the number of memory barrier. It also comes with some overheads.
What about non-blocking fences? Has that been part of this discussion?
No, but I also don't think there's anything about
shmem_fence
that requires it to be blocking. The meat of what is said aboutshmem_fence
is that it "assures ordering". One way of assuring ordering is to callshmem_quiet
(and block); others might count ACKs for the issued puts. Even still, one might not need to block. An internal flag could be set that indicates a fence has been issued (and maybe even capture the list of PEs with outstanding operations).
Something like post-list of request is much more powerful. User provides clear "plan" to the library what are the next steps and library can optimize those. It is difficult to optimize for fence.
The example above can pretty much be replaced by put w/signal. It's more interesting if there's multiple puts being fenced.
That wasn't the point of the example, which acknowledged that
shmem_put_signal
could be used there. The point is that applications (or their authors) will need to have some general sense of the relative cost ofshmem_ctx_fence_pe
vs.shmem_ctx_fence
. If they're the same operation (or have nearly the same cost), the application will almost certainly choose a different synchronization pattern than if they're not. This could be tested empirically, but I would rather have the implementation inform the developer somehow.Going further, for N ==
shmem_n_pes()
, an application should not expect that time(shmem_fence
) = N · time(shmem_fence_pe
). It could be nice, but it is unrealistic. I also don't think an implementation could really give consistent, meaningful relative costs for these operations. Where there's a difference, an application will like have to be tuned empirically to select one over the other.
@nspark - the cost might be runtime depended since with each allocation you can get different layout of nodes and as result, the cost will be highly inconsistent. The cost of operation will depend on the transports that are used and locality.
Something like post-list of request is much more powerful. User provides clear "plan" to the library what are the next steps and library can optimize those. It is difficult to optimize for fence.
I agree it's difficult to to optimize which is why I'd like the use case spelled out. I question the value of tracking PE completions (cost) vs per-PE fence (benefits) without hardware support.
@bcernohous the main use case that is driving this interface is very close to what Nick has posted. The DMAPP interface you mention can be used for it. In fact, as it was pointed out, it is supported in Verbs as well.
I made a choice of Per_PE_fence interface because it covers the use case that Nick has posted, it also expands the semantics to AMOs, and memory operations. Besides that, for network paths, where the packets does not arrive to the destination in order, this provides a potentially low cost option compared to All_PE_Fence option. Also, for such network paths, this can be leveraged for multiple NIC scenarios .i.e., you need to order one PE (1 M HCAs) rather than multiple PEs (NPEs M HCAs).
This though advantageous for OpenSHMEM using p2p/network transports. This is not necessarily advantageous to OpenSHMEM using shared memory (the granularity of the memory barrier is much coarser in shared memory systems). As folks have pointed out it can have variable performance and can hurt the performance of the overall application (with a wrong usage). In the working group, we discussed the idea of using performance_query. I’m not sure how to design that query and standardize it, yet.
Problem :
To achieve ordering for RMA operations between two PEs, the OpenSHMEM programs have to use shmem_fence. The shmem_fence routine, however, orders all RMA operations from the calling PE to all other PEs. Though this achieves the intended result, this is an expensive operation for unordered networks.
Proposal :
Introduce PE specific ordering routines that orders RMA operations from the calling PE to the target PE
shmem_pe_fence(target_pe)
shmem_ctx_pe_fence(ctx, target_pe)
Impact on Users:
This provides a light-weight alternative for shmem_fence if the OpenSHMEM programs desire an ordering guarantee only from the calling PE to a specific PE. There is no change in semantics for the current shmem_fence interface.
Impact on implementation:
Implementations will have to implement the new interface shmem_fence(target_pe) and its context variants shmem_ctx_fence(ctx, int target_pe).