Semantics: Completion of nonblocking collectives

nspark commented 6 years ago

Issue #39 raised the idea of adding nonblocking team-based collectives. Currently, I'm drafting changes for #43 to add shmem_team_alltoall_nbi, but I want to discuss the semantics of nonblocking collectives more broadly here.

My initial sense for semantics of nonblocking collectives (and blocking collectives, for completeness) is that:

A blocking collective operation is "active" between the call that initiates the operation (e.g., shmem_team_alltoall) and the return of that call, after which it is considered "complete" with respect to the calling PE.
A nonblocking collective operation is "active" after it has been initiated via the API call (e.g., shmem_team_alltoall_nbi).
An active nonblocking collective operation is "complete" (with respect to all PEs in the team) after a subsequent call to shmem_team_sync.
At most one collective operation, nonblocking or otherwise, may be active on a team at a given time.

Since shmem_team_sync is itself a collective operation, we would have to somehow designate it as a "special" collective that can be initiated while a nonblocking collective operation is active in order to complete the nonblocking collective operation. This similarly affects blocking collectives, a blocking collective operation is considered complete after the call returns, however, the operation may still be active on other PEs. What is a clear and precise way to distinguish the "specialness" of shmem_team_sync?

More generally, what do others think of these semantics for collective operations?

Poking @gmegan @shamisp @anshumang @jdinan @naveen-rn @manjugv for feedback.

shamisp commented 6 years ago

At most one collective operation, nonblocking or otherwise, may be active on a team at a given time.

Does it mean that in any given time you have only one outstanding collective operations ?

nspark commented 6 years ago

Does it mean that in any given time you have only one outstanding collective operations ?

Yes. With blocking collectives, concurrent collectives can only be performed by multiple threads each with a distinct team handle. With nonblocking collectives, a single thread could execute concurrent collectives, but would still require separate teams for each concurrent operation.

shamisp commented 6 years ago

Ok. It is non blocking only in respect to other one sided operations and compute. Technically, it is blocking in respect to other collectives ops, since we have to block before we issue the following one.

nspark commented 6 years ago

Sure, if you define "blocking" in that way. :wink:

Collectives on the same team must be serialized operations. With blocking collectives, they are implicitly serialized. With nonblocking collectives, they must be explicitly completed and serialized by the program.

Defined as such, nonblocking collectives are also nonblocking with respect to other collectives, blocking or not, on other teams.

shamisp commented 6 years ago

Such semantics makes it relatively simple in terms of implementation... so I'm okay with such semantics if it works for users :-)

manjugv commented 6 years ago

@nspark The Pros I see is that you will be able to hide the alltoall latency (compared to blocking semantics) and it is easier to implement than the non-blocking operations. Why not hide latency by taking advantage of multiple threads available ? do you want take advantage of latency hiding even in the case of single thread mode ? I’m I missing any other use case ?

Naming: Since it is not really non-blocking we could call it as shmem_alltoall_start() and shmem_alltoall_complete() (or something like that) to avoid all the confusion that comes with nbi and shmem_team_sync().

nspark commented 6 years ago

Why not hide latency by taking advantage of multiple threads available ? do you want take advantage of latency hiding even in the case of single thread mode ? I’m I missing any other use case ?

I'd like programs to have the flexibility to choose between latency hiding with threads and using a latency-hiding routine on a single thread. Different applications may require different approaches.

Naming: Since it is not really non-blocking we could call it as shmem_alltoall_start() and shmem_alltoall_complete() (or something like that) to avoid all the confusion that comes with nbi and shmem_team_sync().

Yuck. :wink:

Seriously, since I think that (ultimately) there may be a case for other nonblocking collectives, I think it would be preferable to have one operation that can complete any of the nonblocking collectives. I'm certainly open to the claim that shmem_team_sync() may not be the right one.

The shmem_sync operations are barriers on execution progress that do not perform any completion or ordering of communication. In that sense, using shmem_team_sync() to complete a nonblocking collective is broadening its semantic reach. That said, I'm not sure we could add shmem_team_barrier() for this purpose and unify that cleanly with teams and contexts, which was why we chose not to add it in the first place.

I also don't really get why you and @shamisp keep saying these are actually blocking (or "not nonblocking"). A blocking operation, as we've traditionally defined it for SHMEM, is one that returns control to the caller of the function only after local completion. AFAICT, this definition holds for all the existing data-moving collectives. A nonblocking operation returns control to the caller of the function as soon as possible (without waiting for local completion) and allows the operation to continue independently w.r.t. the calling thread of execution. That's exactly what this notion of a nonblocking collective is doing. It does have the added restriction that only one nonblocking collective may be active at once, but that comes from our agreed-upon limitation of only supporting one blocking collective at a time.

That said, MPI supports multiple nonblocking collectives on a communicator (ref). Should we consider it?

naveen-rn commented 6 years ago

This NBI collective seems to be a good idea. Internally, Cray SHMEM supports some similar (not exactly matching) semantics by enabling some internal environment variables. It would be useful for collectives performed with large message sizes.

I suppose we would now need to clarify that:

The thread which initiated the NBI collectives - should be the same one which needs to call sync or barrier
With just blocking semantics - it was possible for implementer to use the same resource for performing both collectives and pt2pt operations. Adding an extra resource (or hardware collectives) for collectives was optional. But, now it is mandatory that we always need a separate resource for performing collective.

anshumang commented 6 years ago

An active nonblocking collective operation is "complete" (with respect to all PEs in the team) after a subsequent call to shmem_team_sync

To complete a nonblocking collective operation, can an alternative be to call shmem_ctx_quiet and shmem_team_sync? This does not affect the semantics of shmem_team_sync.

nspark commented 6 years ago

@naveen-rn Can you say how Cray SHMEM's similar nonblocking collectives differ in semantics?

With just blocking semantics - it was possible for implementer to use the same resource for performing both collectives and pt2pt operations. Adding an extra resource (or hardware collectives) for collectives was optional. But, now it is mandatory that we always need a separate resource for performing collective.

I don't totally follow your reasoning on the addition on nonblocking collectives requiring such a separate resource. If a team is created without the NOCOLLECTIVE flag, the library will have to provide some resources for collective operations. The point-to-point operations would require a context to be created from that team, but those resource requirements are specified at team creation.

If the concern came from a situation in which a PE might (1) issue a nonblocking collective, then (2) issue P2P operations on that same team, then (3) call shmem_team_sync, I think the same scenario could arise when using multiple threads and the same team in which (for example) Thread 1 performs (1) and (3) and Thread 2 performs (2) "between" (1) and (3).

To complete a nonblocking collective operation, can an alternative be to call shmem_ctx_quiet and shmem_team_sync? This does not affect the semantics of shmem_team_sync.

@anshumang IIRC, we chose not to associate a specific context with the team for collectives. (That is, collectives operate on teams, not contexts.) The implication to me has been that some implementations may have an internal resource they use for collective operations, but we didn't decide to require it (e.g., if it was a hardware collective accelerator, it couldn't be a valid context for RMA) or provide an interface to access it for such a subsequent shmem_ctx_quiet call.

That said, I'm certainly open to the idea, if people think this could help provide more-consistent mechanism for completing nonblocking collectives.

anshumang commented 6 years ago

@nspark I get the point now. A followup clarification-

Since shmem_team_sync is itself a collective operation, we would have to somehow designate it as a "special" collective that can be initiated while a nonblocking collective operation is active in order to complete the nonblocking collective operation.

With blocking collectives:

Thread 1                                                    Thread 2 
shmem_team_alltoall(team1)              shmem_team_alltoall(team1) //library needs to detect this as error? 
                                              shmem_team_sync(team1) //library needs to allow this?

With non-blocking collectives:

Thread 1
shmem_team_alltoall_nbi(team1)
shmem_team_alltoall_nbi(team1) //library needs to detect this as error?
shmem_team_sync(team1) //library needs to allow this?

nspark commented 6 years ago

@anshumang I think both cases you've sketched are left as undefined behavior. The library isn't required to detect the concurrent use of a collective. The application developer is required to respect the restriction of no concurrent collectives on a single team.

naveen-rn commented 6 years ago

Can you say how Cray SHMEM's similar nonblocking collectives differ in semantics?

@nspark Cray SHMEM has an internal env variable - _SHMEM_ALLTOALL_EARLY_EXIT which allows alltoall operations to exit early. When using the alltoall with this env variable enabled, it is guaranteed to be complete only by calling a barrier by all participating PEs in the active set. This semantics might look similar to the proposed NBI alltoall - but it is not truly non-blocking. It just exits early, it is not completely non-blocking.

To me, a true NBI alltoall implementation returns immediately and the progress is checked only at the barrier or sync operation. I kind of understand what @shamisp and @manjugv mean by "blocking" - if the collectives are completely offloaded to the hardware then it would be non-blocking (not sure if any hardware can perform this kind of large data size offload). But, any software based implementation might make it partially blocking - say, if the software algorithm uses somekind of tree-based transfers and we don't have a progress thread.

naveen-rn commented 6 years ago

I don't totally follow your reasoning on the addition on nonblocking collectives requiring such a separate resource. If a team is created without the NOCOLLECTIVE flag, the library will have to provide some resources for collective operations. The point-to-point operations would require a context to be created from that team, but those resource requirements are specified at team creation.

@nspark I think, this one was just an implementation idea. Please ignore, if you consider this as a corner case or irrelevant.

A team_create operation without NOCOLLECTIVE flag should allocate the requested amount of resources. These resources will be used for creating contexts attached to the team.
Implementations are allowed to add extra resource to each such teams - if they are interested in using those resources for performing collectives. But, it is not hard-requirements, just implementation decisions. If necessary, implementations can even use the resource managed by SHMEM_CTX_DEFAULT for performing collectives across all teams. I think this should be correct, because even if there are hardware offload options - it would mostly be shared across teams.

With this understanding, consider the following example:

#pragma omp parallel num_threads(2)
{
    shmem_team_create(team1, NOCOLLECTIVE, resource=1);
    if (omp_get_thread_num() == 0) {
        shmem_ctx_create(team1, ctx1, private);
        shmem_team_alltoall(team1);
        shmem_ctx_destroy(ctx1);
    }
    #pragma omp barrier
    if (omp_get_thread_num() == 1) {
        shmem_ctx_create(team1, ctx2, private);
        shmem_ctx_destroy(ctx2);
    }
}

The above example should work even with just one resource available for a team. We could use it for both context-based RMA/AMOs and collectives.

#pragma omp parallel num_threads(2)
{
shmem_team_create(team1, NOCOLLECTIVE, resource=1);
if (omp_get_thread_num() == 0) shmem_ctx_create(team1, ctx1, private);
if (omp_get_thread_num() == 0) shmem_team_alltoall_nbi(team1);
if (omp_get_thread_num() == 0) shmem_ctx_destroy(ctx1);
if (omp_get_thread_num() == 1) shmem_ctx_create(team1, ctx2, private);
if (omp_get_thread_num() == 0) shmem_team_sync(team1);
if (omp_get_thread_num() == 1) shmem_ctx_destroy(ctx2);
}

While with the NBI collectives, the above use case is valid - a thread which created a context and initiated the NBI collectives, can destroy its context before calling shmem_team_sync. Hence, implementations shouldn't use the resource from the team-creation for performing collectives - we always need an explicit resource or use the resource managed by SHMEM_CTX_DEFAULT or hardware offload.

manjugv commented 6 years ago

Seriously, since I think that (ultimately) there may be a case for other nonblocking collectives, I think it would be preferable to have one operation that can complete any of the nonblocking collectives. I'm certainly open to the claim that shmem_team_sync() may not be the right one.

If we introduce only alltoall with this semantic, I prefer something that explicit calls that out. However, if we introduce non-blocking explicit/implicit variants for all collective operations then I agree that this does not suit well.

I also don't really get why you and @shamisp keep saying these are actually blocking (or "not nonblocking").

I guess we don’t like the restriction of having a constraint on the number of outstanding operations when we call nonblocking. :) Also, blame it on UPC and SGI they gave a different name for this. :)

That said, MPI supports multiple nonblocking collectives on a communicator (ref). Should we consider it?

Introducing nonblocking implicit/explicit collectives is a significant change to the standard. It is huge undertaking for the implementations. Hope it is not a part of Teams proposal. It deserves a separate discussion and should be a separate proposal; there is a lot of semantics/implementation/performance implications that needs clarity.

gmegan / specification

Semantics: Completion of nonblocking collectives #44