Resolve context-team interaction issues

nspark commented 5 years ago

The current draft of the teams proposal puts the PE team inside the communication context and allows for the reassignment of teams to contexts. This issue is meant to detail the proposed model and potential alternatives and capture feedback or issues with any of these models as well as other interactions between teams and contexts.

Teams in Contexts

Each context is associated with a team.
Each team may be assigned to zero or more contexts.
By default (i.e., at context creation), each context is associated with SHMEM_TEAM_WORLD.
A context may have its associated team updated (e.g., via shmem_ctx_set_team).

Potential advantages

Context-based point-to-point operations can use team-relative PE numbering.
Updating the team associated to a context could be as simple as a pointer assignment.
Contexts represent semi-virtualized hardware resources. Separating context and team creation and allowing the reassignment of teams to context, could minimize underutilized resources.

Disadvantages or concerns

From https://github.com/gmegan/specification/issues/28#issuecomment-401832971:
- Networks may need coordinated source- and target-side resources to enable full messaging throughput (e.g., in multi-HCA/HFI setups).
- OpenSHMEM implementations that share lower implementation layers with MPI or ones that seek to provide MPI interoperability may want to share resources and implementation paths, potentially backing a SHMEM team with an MPI communicator.
- Networks with hardware collective acceleration may need these resources to be allocated collectively.
Concurrent collective calls from multiple threads may not be possible (e.g., any internal pSync or pWork state is per-PE, not per-thread).

Contexts in Teams

Each team is associated with a context.
Each context may be assigned to zero or more contexts.
At team creation, the associated context is provided and fixed.

Potential advantages

Intended to address the concerns of "Teams in Contexts"
Could support concurrent collective calls (e.g., each thread has a team, so internal state is thread/team-specific).

Disadvantages or concerns

Generally a more-restrictive model.
May lead to creation of more teams (i.e., consumption of more resources) than with Teams-in-Contexts. For example, a software communication pipeline that wants to use N contexts on one logical team will need to create N team instances of the same logical team.
Team-relative point-to-point operations would require a new shmem_team_*-prefixed API or manual PE index translation.

Teams use `SHMEM_CTX_DEFAULT` ("Plan B")

Teams and contexts are created and managed separately.
All teams are implicitly associated with the default context (e.g., SHMEM_CTX_DEFAULT).
Team-based collective operations added.
Context-based point-to-point operations exist as-is (i.e., use "world" PE numbering); no team-based point-to-point operations are added.

Disadvantages or concerns

No opportunity for communication concurrency in collectives.
Team-relative point-to-point operations require manual PE index translation.

Context from Teams (updated 7/23)

Given a team, a team-specific context can be created (e.g., shmem_team_create_ctx).
Concurrent collectives on the same logical team require distinct team objects.
Operations on contexts are performed relative to the associated team. The default context and local semantic contexts (shmem_ctx_create) are associated with SHMEM_TEAM_WORLD.

Use Model

Program creates a new team: shmem_team_split_* (collective operation)
- Internal pSync and pWrk state lives with the team.
- Depending on the network, hardware offload engine state may live with the team.
Program creates team-based context(s) from that team: shmem_team_create_ctx (collective operation)
- Depending on the network, hardware offload engine state may live with the context.
Program uses team-based context(s), local-semantic contexts (shmem_ctx_create), or the default context for RMA, AMO, collective, and synchronization operations
- P2P operations are performed with team-based PE numbering. (SHMEM_TEAM_WORLD is the associated team for all local-semantic contexts and the default context.)
- Collective operations may be performed on any context, however, the team associated with the context is the point of serialization.
Program destroys team-based context(s): shmem_team_destroy_ctx (collective operation)
Program destroys the team: shmem_team_destroy (collective operation)

Potential advantages

Allows for collective wire-up of any required resources.
Allows for communication concurrency with concurrent collectives (using distinct team objects)

Disadvantages or concerns

???

Open questions + design points

What is the "final" handle for team-relative operations (including RMA, AMO, and collectives)?
- ~~If it is the team, then...~~
- If it is the context, then...
- Using distinct context handles -- even with each created from shmem_team_create_ctx -- may not be safe for concurrent collectives. They would need to be distinct contexts created from distinct teams. Not a concern; this is consistent with the requirement for global serialization of collectives using the same active set in 1.4.

nspark commented 5 years ago

On the OpenSHMEM mailing list, @RaymondMichael wrote (in reference to the teams-in-contexts idea):

Connection Creation

Let’s say we have two processes, each with two threads. In each process thread A is monitoring the Default Context and thread B has a Private Context. Contexts are backed by real hardware endpoints. The two B threads are going to do a collective together and pass their Contexts to the function. In the most basic situation, while they can inject with their Private Context, they have to target the shared Default Context because they have no means of learning of the hardware address of the Private Context that the other PE is using. Each host may have 16 HCAs / HFIs to inject with, but traffic is only coming in through a single endpoint on a single HCA / HFI.

Now let’s say you’ve added lazy connection creation. You have a helper pthread monitoring a TCP/IP socket. Process 0 can send a message to Process 1 asking for hardware endpoint addresses. The helper thread sees the message and responds with the address information of all the endpoints the PE is using. The helper pthread still won’t know which Context thread B will use for the collective. There are also scenarios where thread B has to regularly poll a socket / endpoint for incoming two sided message traffic for doing the handshake so their endpoints can then exchange real data.

Integrated MPI

My runtime started supporting integrated MPI + OpenSHMEM well over a decade ago. Other runtimes have started doing the same. MPI is heading in the direction of adding an equivalent to OpenSHMEM’s Contexts. Different MPI Communicators may be backed by different hardware endpoints. If you have a separate thread per Communicator, this reduces locking overhead and increases throughput. To send a message to another rank with a specific Communicator, you send to a specific endpoint. This is even more important with hardware tag matching. My high level MPI and OpenSHMEM software both use the same low level IB / OPA / GenZ software and resources. It would make the lives of runtime implementors a lot easier if the way that Teams use Contexts is the same way that Communicators use them. Yes I’m saying I’ll probably back a Team with the same data structure I use to back a Communicator.

Collectives

If in a collective we know which other hardware endpoints will be used for the collective then we can hand the operation off for acceleration.

Summary

It is for these reasons that I want the Team creation function to take a Context as one of its parameters. Operations with the Team will always use that Context. If we add explicit Domains, then swap Domain for Context in the parameters.

shamisp commented 5 years ago

The real problem is not related to the teams. You are trying to workaround the fact that context is one sided object and there is no easy way to associated the initiator side context with a target context.

This the issue well know issue with the existing design of context. It was brought up by Howard and Pavan while back. What you really want is a way to create a context that associated with some remote context in a user controlled way.

Team does not aim to solve this particular issue. All it does: a. Provides translation function b. Potentially holds collective context (pWork/pSync) or other HW offload mechanism (reference to switch offload queue or multicast group).

We happen to consider to dereference (pointer, no logical connection) the team object from a context object, which essentially means we use it as a container object and this is it. The other alternative (IMHO preferable) is to pass teams as separate object to all function. This will avoid the confusion between teams and context association.

Other way around - I was about to suggest to create the team object fist. And then you can create context object as a collective call using the defined team object (new API) This way you will have a clear way to link the local/remote context.

naveen-rn commented 5 years ago

In the Teams in Context approach, we say that concurrent collective communication is not possible because of the symmetric sync and work array issue. I think this is same for the Contexts in Teams approach as well, mapping a shareable context to any team is allowed and a shareable context can be accessed by multiple threads concurrently - which is same with any team based collective operation. It looks we would still face the psync and pwrk array issue for the concurrent collective communications.

Or, are we trying to make a team as a thread-based operation?

shamisp commented 5 years ago

For the upcoming version of the spec I suggest no to support threads + collectives. A.k.a defined as an undefined behavior. I don't think we have enough time to discuss resolve threading issues. As for now, I'm more inclined to pass team as a separate self-containing argument to collective call (a.k.a no context argument )

nspark commented 5 years ago

Responses to @shamisp

You are trying to workaround the fact that context is one sided object and there is no easy way to associated the initiator side context with a target context.

I haven't forgotten about that issue. If anything, I'm trying to see whether we can resolve it here with teams. If we're providing collective creation semantics over a group of PEs, maybe this is the place to address this issue.

Team does not aim to solve this particular issue. All it does: a. Provides translation function b. Potentially holds collective context (pWork/pSync) or other HW offload mechanism (reference to switch offload queue or multicast group).

Except that team-based collectives don't operate in a vacuum. Unless we take what I'd think was an odd approach and say, "Team-based collectives operate on the magic team context," then we have to address some interaction of teams and contexts. Otherwise, we might leave it undefined whether shmem_barrier_all() can ensure global completion of a collective operation.

The other alternative (IMHO preferable) is to pass teams as separate object to all function. This will avoid the confusion between teams and context association.

Except this doesn't solve @RaymondMichael's problem at all. If anything, it makes it worse, since every collective call could have a distinct team-context pairing. I'm not saying it's ideal, but at least with teams-in-contexts, the library is told (via shmem_ctx_set_team) when the team-context pairing is changed.

Other way around - I was about to suggest to create the team object fist. And then you can create context object as a collective call using the defined team object (new API) This way you will have a clear way to link the local/remote context.

At first, I thought you were proposing what I've called "Contexts in Teams" above. I think "Contexts in Teams" could resolve the concerns that @RaymondMichael listed, but it seems like you're proposing that the context created from the team. It's not clear whether you'd favor X-in-Y or Y-in-X, but I think you'd need one to ensure the pairing, since these would now be mutually bound resources. A sketch might look like:

shmem_team_t myteam = shmem_team_split(...);
shmem_ctx_t myctx = shmem_team_create_ctx(team, ...);
shmem_broadcast(team, ...); // or do we use ctx as the argument?
shmem_put(ctx, ...);
shmem_ctx_quiet(ctx);
shmem_barrier(myteam);
shmem_team_destroy_ctx(myctx);
shmem_team_destroy(myteam);

Is this along the lines of what you're thinking?

For the upcoming version of the spec I suggest no to support threads + collectives. A.k.a defined as an undefined behavior.

I don't think that undefined is an option. We have a threading support API, and a library that provides SHMEM_THREAD_MULTIPLE support says that "any thread may invoke the OpenSHMEM interfaces". If we leave their interaction unspecified, then the guidance of the threading model (1.4 §9.2) takes effect. If we want to restrict the interaction, we have to be specific. Does it only work under SHMEM_THREAD_SINGLE? Do the collectives have a behavior akin to SHMEM_THREAD_SERIALIZED? We can't ignore the issue, though we may place restrictions on the behavior (just as we do with private contexts).

Responses to @naveen-rn

In the Teams in Context approach, we say that concurrent collective communication is not possible because of the symmetric sync and work array issue.

👍 though I think we need to be specific what "not possible" means. I would suggest that it be equivalent to SHMEM_THREAD_SERIALIZED, provided the context itself was not private.

I think this is same for the Contexts in Teams approach as well, mapping a shareable context to any team is allowed and a shareable context can be accessed by multiple threads concurrently - which is same with any team based collective operation. It looks we would still face the psync and pwrk array issue for the concurrent collective communications.

I don't really follow this. If the context is shareable and each thread has its own team -- maybe you weren't specifically saying that -- then I would think that multiple threads should be able to execute concurrent collectives. In that case, the only shared object is the context, which is shareable.

shamisp commented 5 years ago

Responses to @nspark

You are trying to workaround the fact that context is one sided object and there is no easy way to associated the initiator side context with a target context.

I haven't forgotten about that issue. If anything, I'm trying to see whether we can resolve it here with teams. If we're providing collective creation semantics over a group of PEs, maybe this is the place to address this issue.

It is actually quite complicated issue. I don't think we have enough time to resolve it. I would suggest not to rush this and take time.

Team does not aim to solve this particular issue. All it does: a. Provides translation function b. Potentially holds collective context (pWork/pSync) or other HW offload mechanism (reference to switch offload queue or multicast group).

Except that team-based collectives don't operate in a vacuum. Unless we take what I'd think was an odd approach and say, "Team-based collectives operate on the magic team context," then we have to address some interaction of teams and contexts. Otherwise, we might leave it undefined whether shmem_barrier_all() can ensure global completion of a collective operation.

Team already operate in own context (psync/pwork). shmem_barrier_all() can be defined as a flush on the default context. For all the rest context user has to call quiet followed by barrier/sync.

The other alternative (IMHO preferable) is to pass teams as separate object to all function. This will avoid the confusion between teams and context association.

Except this doesn't solve @RaymondMichael's problem at all. If anything, it makes it worse, since every collective call could have a distinct team-context pairing. I'm not saying it's ideal, but at least with teams-in-contexts, the library is told (via shmem_ctx_set_team) when the team-context pairing is changed.

It does not solve but it prevents developer from making false assumptions, which will lead to broken and incorrect implementation of SHMEM.

Other way around - I was about to suggest to create the team object fist. And then you can create context object as a collective call using the defined team object (new API) This way you will have a clear way to link the local/remote context.

At first, I thought you were proposing what I've called "Contexts in Teams" above. I think "Contexts in >Teams" could resolve the concerns that @RaymondMichael listed, but it seems like you're proposing >that the context created from the team. It's not clear whether you'd favor X-in-Y or Y-in-X, but >I think you'd need one to ensure the pairing, since these would now be mutually bound resources. A >sketch might look like:
shmem_team_t myteam = shmem_team_split(...);
shmem_ctx_t myctx = shmem_team_create_ctx(team, ...);
shmem_broadcast(team, ...); // or do we use ctx as the argument?
shmem_put(ctx, ...);
shmem_ctx_quiet(ctx);
shmem_barrier(myteam);
shmem_team_destroy_ctx(myctx);
shmem_team_destroy(myteam);
Is this along the lines of what you're thinking?

Exactly. Lets create team first and then use the team to create/allocate context as a collective (or not, it may depend on underlaying HW) I also suggest to drop (deprecate) context from argument list in p2p operations and replace it with teams. I know it is a bit aggressive but it will result in much cleaner/consistent interface.

For the upcoming version of the spec I suggest no to support threads + collectives. A.k.a defined as an undefined behavior.

I don't think that undefined is an option. We have a threading support API, and a library that provides SHMEM_THREAD_MULTIPLE support says that "any thread may invoke the OpenSHMEM interfaces". If we leave their interaction unspecified, then the guidance of the threading model (1.4 §9.2) takes effect. If we want to restrict the interaction, we have to be specific. Does it only work under SHMEM_THREAD_SINGLE? Do the collectives have a behavior akin to SHMEM_THREAD_SERIALIZED? We can't ignore the issue, though we may place restrictions on the behavior (just as we do with private contexts).

I didn't express myself well. I suggest to follow MPI semantics. No two threads can call a collective with the same team. Otherwise the behavior is undefined.

nspark commented 5 years ago

It is actually quite complicated issue. I don't think we have enough time to resolve it. I would suggest not to rush this and take time.

I agree that it's complicated. I'm not trying to rush it. I think it's worth discussing now. Teams is looking to be the big feature of 1.5, and I'd rather delay 1.5 a bit to get a full teams API (including the teams-contexts interaction) than push through a simpler teams API in 1.5 and work through an entire extra cycle just to iron out that interaction. It's not rushing it; it's just not wanting to push it to the backburner.

Exactly. Lets create team first and then use the team to create/allocate context as a collective (or not, it may depend on underlaying HW) I also suggest to drop (deprecate) context from argument list in p2p operations and replace it with teams. I know it is a bit aggressive but it will result in much cleaner/consistent interface.

We could skip the API churn by embedding the team in the context -- which is created collectively across the team -- and not allowing the team-context connection to be modified.

shamisp commented 5 years ago

@nspark

Plan "B" that @gmegan and I discussed internally is to leave p2p interface for v1.5 as it is without creating any team <-> context dependency. As for now user will be able to use the translate function to achieve exactly the same effect. Since the majority of users wrap it around with another library that hides the translation, it should be a big deal ?

This will buy us a bit more time to think about this issues. I think the past we discussed this "reverse" combination (team->context) lead to some issues as well.

Anyways, this is something to consider...

manjugv commented 5 years ago

Potential advantages

Context-based point-to-point operations can use team-relative PE numbering. Updating the team associated to a context could be as simple as a pointer assignment. Contexts represent semi-virtualized hardware resources. Separating context and team creation and allowing the reassignment of teams to context, could minimize underutilized resources.

This would be a collective operation and probably more expensive than a simple collective operation, since the contexts are internally a set of network endpoints and typically connected with the PEs it communicates. When you associate a context with a team, the PE that attaches the context need to connect to all the PEs as required by the collective algorithms. Another option is to have contexts connect to all PEs, but you don’t want to do that for obvious resource scaling reasons.

Changing contexts associated with teams are also very expensive operation, IMO. If we change the association (change teams) we have to first quiesce all the operations on the contexts, and then for new context attached, we need to have new connections. I’m I missing something here ?

manjugv commented 5 years ago

I agree with Nick on the timing. Also, If we intend to add this functionality. Since they are so intertwined, why delay it for the future ? The issues will not go away. It might be more challenging than the current situation because you might have to design something that is compatible with both contexts and teams, rather than just context.

nspark commented 5 years ago

Changing contexts associated with teams are also very expensive operation, IMO. If we change the association (change teams) we have to first quiesce all the operations on the contexts, and then for new context attached, we need to have new connections. I’m I missing something here ?

Under my notion of the collectively-create-context-from-team idea, the context-team association is immutable, so changing the context associated with a team doesn't happen.

manjugv commented 5 years ago

Ok.

“Contexts represent semi-virtualized hardware resources. Separating context and team creation and allowing the reassignment of teams to context, could minimize underutilized resources.”

This led me to think that you are proposing to change context-to-team association after creation.

nspark commented 5 years ago

This led me to think that you are proposing to change context-to-team association after creation.

Sorry. You're mixing up the potential models. The excerpt you quoted came from the "Teams in Contexts" model. The idea that I think @shamisp and I are trending toward is the "Contexts from Teams" model, not to be confused with "Contexts in Teams".

Edit: I've updated the main issue description to better separate these ideas.

shamisp commented 5 years ago

@nspark Let's say we introduce this collective context. How does it co-exist with regular context ? Shell we add special API to query the type of context ? How users will know what context create API they have to use in order to create context ?

nspark commented 5 years ago

This is all just spitballing, but...

How does it co-exist with regular context ?

I think a user application has access to both types of contexts (i.e., local-only contexts and team-based contexts).

The local-only contexts that were added in 1.4 are defined to operate on the default team (SHMEM_TEAM_WORLD). This means that all the point-to-point operations (e.g., RMA, AMO) use the global PE numbering.
The team-based contexts provide the same capabilities for communication pipelining and context-based synchronization (e.g., fence, quiet) as local-only contexts, but point-to-point operations are performed with the team-relative PE numbering.

In both cases, concurrent collectives are not allowed on the same team, which stays consistent (in my understanding) with what we've drafted so far. So, an application can only use concurrent collectives through the new API by first creating multiple teams (potentially of the same logical team) and using those team handles (or context handles, whichever we decide) as the "context" for a collective operation.

Shell we add special API to query the type of context ? How users will know what context create API they have to use in order to create context ?

Whether an application author wants local-only or team-based contexts is really an implementation decision, likely derived from the constraints identified above. We've also discussed how, on some architectures, team-based contexts might allocate additional receiver/target side resources that may reduce the impedance mismatch of N sender/source-side resources driving RMA on 1 receiver/target-side resource. So, there may be non-semantic performance effects as well. In either case, it's an application-driven choice and the resultant context handle is used the same regardless of the creation mechanism, except that...

Assuming they both use the same user-facing type (i.e., shmem_ctx_t), the main (only?) case (that I've thought of so far) for which there is API ambiguity is destruction. For example, a user could create an array of context handles and populate all the even-indexed handles with local-only contexts and all the odd-indexed handles with team-based contexts. This would be rather bizarre. But, the user would need to call the appropriate destruction routine (i.e., shmem_ctx_destroy or shmem_team_destroy_ctx), which would not be explicit in the type of the object.

The context handle type (shmem_ctx_t) is opaque to the user, but it might be possible for the implementation side to have a tag field on the structure indicating whether it is a local-only or team-based context. It might be the case that this tag would only need to be inspected during context destruction. If so, then only one destruction API (shmem_ctx_destroy) would be necessary, and it would have collective semantics when destroying a team-based context and non-collective semantics (e.g., no barrier) when destroying a local-only context.

While one might bristle at this idea at first, recall that contexts already have an options field and an associated option value (SHMEM_CTX_PRIVATE) that determines whether a context much be destroyed by the thread that created it. This is an attribute of the context that is not exposed in the type, but affect whether the destruction of the context is well-defined.

nspark commented 5 years ago

After talking this over with @BryantLam and some others, we thought it could be helpful to identify a couple use-cases for teams + contexts that are motivating problems for us.

Use-Case 1:

Given a current team, which may be SHMEM_TEAM_WORLD, each PE wants to create an N-deep pipeline of collective operations (e.g., nonblocking all-to-all). Each pipeline stage requires one team (for pSync state) and one context (for nonblocking completion). In the absence of a nonblocking all-to-all, each pipeline stage may execute a blocking all-to-all in a unique thread.

Use-Case 2:

Consider a problem space of 1..N, each of which requires M PEs for processing (e.g., due to memory constraints). Divide the full PE space (M * L, typically with L << N) into teams of M PEs; i.e., the x-axis team resulting from shmem_team_split_2D(M, ...). Let the "leader" team be the y-axis team for which x = 0.
For each x-axis team, let all M PEs create T threads, each of which wants its own private context for maximum communication concurrency. Within the x-axis team, primary operations are RMA and AMOs performed by the threads. Collectives (e.g., barrier) may be performed occasionally (i.e., they are not performance critical).
For the leader team, primary operations are AMOs for work division of the problem space, as well as occasional collectives.

My take on how the notional proposals align with the use-cases is:

Proposal	Use Case 1	Use Case 2
Teams in Contexts	+++	+++
Contexts in Teams	+++	-
Teams use `SHMEM_CTX_DEFAULT`	---	-
Contexts from Teams	+++	+++ (revised)

where:

Use Case 1 is primarily affected by whether the collective of each pipeline stage operates on a unique context. The Teams use SHMEM_CTX_DEFAULT proposal only supports one context for collectives; the other proposals all have a 1-to-1 linkage of the team and context due to the need for concurrent collectives.
Use Case 2 suffers from the resource expanse issue of replicating the same logical team to expose multiple contexts to the communicating threads. Revised: Both Teams in Contexts and Teams from Contexts do fine; the other proposals can use 1.4's local contexts for the threads, but would require the shmem_team_translate function for all point-to-point operations.

gmegan commented 5 years ago

I am trying to see how Contexts from Teams is a problem for Case 2... If each PE needs multiple contexts for a single team, then the team doesn't need to be replicated. We could just call context create multiple times, each time passing the same team argument? As I understand it, the issue with teams in contexts is more about the fact that a context cannot switch its team, since the team is going to define how the context resources are connected for point to point.

I do see an issue when each PE has N teams and T threads, where each thread has a private context that it wants to use for all N teams. In that case, using "Contexts from Teams", each thread would actually need to make N contexts, one for each of the N teams. This is overallocation of resources in some cases. For example if there is a team of PEs 2, 4, 6, 8... and another team of PEs 4, 8, 12... and there is no need to separate communication between the teams, then making a context for each team seems unnecessary. As you say, it is better to just use one context and then shmem_team_translate for the PEs in the second team. Maybe this is your Case 2 and I misread it.

So, if the context API itself is still a bit flexible, which sounds like is the case, then is it possible to add the ability to duplicate contexts so that new context shares resources with the old one, i.e. it does not isolate ordering and completion - quiet on one will quiet the other, etc... but it does provide different point to point connection mapping.

Jim was proposing on the call to add explicit domain management to separate out resources from contexts and then allow multiple contexts to have the same domain. I worry that this is too many subtle differences to deal with in a programming model. I don't know that I could explain domain vs. context very clearly. I think I can explain communication isolation (context resources) and point to point connection mapping of communication resources (teams), and then say that a context encapsulates all of these features, and may share resources with other contexts.

shamisp commented 5 years ago

@nspark - maybe I'm missing something but I struggle to understand why case2 cannot be efficiently handled with context creation from team obj. Multiple contexts may point to the same team. Team potential can manage a pool of psync/pwork arrays.

RaymondMichael commented 5 years ago

I think what we're saying is that:

o Teams are analogous to MPI_Groups o We create Contexts from Teams, which are analogous to MPI_Comms o Only one collective can be in flight for a Context at a time o pSync / pWrk is embedded in the Context o (In a latter release, Contexts derived from Teams may optionally have their own heaps)

I'm a fan of Domains and this makes them more difficult, but this otherwise works for me.

shamisp commented 5 years ago

o Teams are analogous to MPI_Groups o We create Contexts from Teams, which are analogous to MPI_Comms o Only one collective can be in flight for a Context at a time o pSync / pWrk is embedded in the Context

psync/pwrk in the team, not context. team == communicator. Context is similar to endpoint MPI proposal, which got rejected.

RaymondMichael commented 5 years ago

@shamisp does your concept of Teams have an implicit initial Context? I.e. if I create a Team and pass the Team to a collective, the collective would use the Context that was created by the library as part of setting up the Team?

nspark commented 5 years ago

@gmegan: I am trying to see how Contexts from Teams is a problem for Case 2... If each PE needs multiple contexts for a single team, then the team doesn't need to be replicated. We could just call context create multiple times, each time passing the same team argument? [...]

@shamisp: maybe I'm missing something but I struggle to understand why case2 cannot be efficiently handled with context creation from team obj. Multiple contexts may point to the same team. Team potential can manage a pool of psync/pwork arrays.

I think that my notion of Contexts from Teams just wasn't completely set on whether the context-to-team association was N-to-1 or 1-to-1. I agree that allowing the N-to-1 (sub)model improves usability. From an API perspective, I think the distinction primarily matters in determining whether it is the team handle or the context handle that is provided to P2P and collective operations.

To be clear, I think allowing the N-to-1 model means that the context handle is the primary argument (even for collectives). For example, if two threads are using distinct contexts created from the same team, a team barrier called by one thread should not necessarily affect a context used by the other team (e.g., SHMEM_CTX_DEFAULT), which I think arises in @RaymondMichael's question about an implicit initial context.

@RaymondMichael: does your concept of Teams have an implicit initial Context? I.e. if I create a Team and pass the Team to a collective, the collective would use the Context that was created by the library as part of setting up the Team?

I think we can avoid this confusion by making the context handle the primary argument for collectives. In this case, a team handle only really encapsulates the pSync/pWrk state and the PE mapping. It can't be used for collectives or P2P operations because it has no associated network resource.

RaymondMichael commented 5 years ago

@nspark I like the idea of the Context being the primary handle. I think that means the pSync / pWrk needs to be in the Context though. If I'm doing collectives with two different Contexts and they each need pSync / pWrk state, then having them in the Team seems like a problem.

nspark commented 5 years ago

@RaymondMichael I would be concerned about resource allocation and consumption if every team-derived context had pSync/pWrk state. If they did, then Use-Case 2 would consume far more pSync/pWrk state than it needs. (Of course, the argument for hiding the pSync/pWrk state comes from the fact we're far past the memory restrictions of the T3E days.) It feels like overkill to ensure that contexts can always support concurrent collectives.

I understand if your concern is that two contexts created from the same team have SHMEM_THREAD_SERIALIZED behavior with respect to each other for collective operations. Conceptually, this doesn't seem terrible to me, but it does require application authors to account for the context-to-team mapping in their use of contexts. I haven't thought deeply about whether it presents issues for library authors using SHMEM, but I think that for those cases the SHMEM objects needs to be managed entirely within the library (e.g., there are no outward-facing foolib_fn(shmem_ctx_t, ...) functions).

naveen-rn commented 5 years ago

@nspark, If I understand the proposal correctly, the order of usage will be as follows - correct me if I'm wrong:

create a team
create a context a. create context with a team argument passed as input, this would result from a new routine with two-sided semantics and it will be different from the existing routine with one-sided local semantics b. we could also create a context with one-side semantics using the existing routine
all P2P operations from contexts with will have implicit internal translation (either explicit mapped team or the default SHMEM_TEAM_WORLD)
all collectives will take context as argument instead of teams object a. if the context doesn't have implicit team mappings (from creation) - it will use SHMEM_TEAM_WORLD b. else it will use the explicit mapped team c. it is users responsibility to handle/avoid concurrent collectives from contexts sharing the same team

Questions:

In future, when we have support for multiple symmetric heap - will it be part of the property of team object or context object?
Should the team creation argument still need to be a collective operation? is psync and pwrk the only reason to keep the team creation operation collective?

RaymondMichael commented 5 years ago

@naveen-rn Team or Context creation needs to be collective because we need to exchange hardware connection information. If my PE is using ten different network endpoints, I want the other PEs to know which one to send their packets to in order to land nearest the right CPU and memory. This information is important for wiring up a few different hardware types.

naveen-rn commented 5 years ago

@naveen-rn Team or Context creation needs to be collective because we need to exchange hardware connection information.

I can see the reason for context creation being a collective operation. As per @nspark N-to-1 contexts to teams mapping proposal, we need to create a team first and get the team object before creating a context. In that case, if we are not worried about psync/pwrk team creation doesn't need to be collective - only the context creation needs to be a collective.

shamisp commented 5 years ago

@naveen-rn I would put symmetric heap on the team. The same true for psync/pwork. Team (with collective flag set) is only the object that is aware of collective knowledge. Context might be allocated as a collective but it has no idea if it will be used in a collective.

nspark commented 5 years ago

I updated Context from Teams with a more detailed description, including a use-flow similar to what @naveen-rn shared.

In future, when we have support for multiple symmetric heap - will it be part of the property of team object or context object?

I agree with @shamisp and would put it as part of the team.

Should the team creation argument still need to be a collective operation? is psync and pwrk the only reason to keep the team creation operation collective?

Team creation, assuming the NOCOLLECTIVE flag is not provided, is collective. Since team-based context creation is collective, one needs to be able to call collectives on the team itself. Thus, one could not create a context from a NOCOLLECTIVE team.

This may not be just for the pSync or pWrk state, though I think those live there, too. Some networks with hardware collective offload engines may be better suited to have this resource associated with the team; some others, with the context. It just depends, and I think this API provides enough flexibility for either case.

shamisp commented 5 years ago

@nspark - and this makes NOCOLLECTIVE useless since there is no way to pass such team to p2p call.

nspark commented 5 years ago

and this makes NOCOLLECTIVE useless since there is no way to pass such team to p2p call.

I think it's use is somewhat analogous to ORNL's sets/groups proposal. If one needed to divide up the PE space in one way (e.g., split_strided) first, then another way (e.g., split_2D), but only cared about actually using the latter team for communication operations, one could do so without the overhead of any wasted resources in the former team.

That said, if it is enough of a simplification, we could drop NOCOLLECTIVE entirely.

shamisp commented 5 years ago

@nspark - thinking about this. It is useful if I use translate function in combination with one-sided context. Essentially it is only the way to decouple context from team.

naveen-rn commented 5 years ago

@nspark @shamisp Do we destroy a context (1-to-1 association) or all the associated contexts (N-to-1 association) when a team is destroyed?

naveen-rn commented 5 years ago

Program destroys team-based context(s): shmem_team_destroy_ctx (collective operation) Program destroys the team: shmem_team_destroy (collective operation)

I missed this - IIUC, it looks like there are no implicit ctx destroy when the team is destroyed even for an associated shared context. Users have to explicitly destroy the context (all properties) before destroying a team.

shamisp commented 5 years ago

@nspark - just to clarify. My understanding that we need to pass context to shmem_barrier only due to explicit quite semantics within barrier ? technically each thread can call quite followed by sync.

shamisp commented 5 years ago

@naveen-rn IMHO we would have to introduce collective destroy.

nspark commented 5 years ago

@shamisp I'm fine with shmem_barrier(shmem_ctx_t ctx), but are you okay with that and shmem_sync(shmem_team_t team)? Is that too pedantic?

nspark commented 5 years ago

On today's call, @jdinan suggested an alternate model, the highlights of which I'm trying to capture here:

Team creation is collective and includes parameters specifying whether the team will use collectives and how many contexts will be created on this team.
Team-based contexts are created locally (not collectively).
Teams are used for collective operations.
Contexts are used for point-to-point and memory-ordering operations.

naveen-rn commented 5 years ago

Team creation is collective and includes parameters specifying whether the team will use collectives and how many contexts will be created on this team. Team-based contexts are created locally (not collectively).

I'm not sure, if my understanding is correct - these two semantics looks more like the implementation will be allowed to create a pool of network resource when a team is created and then it will be mapped to the contexts when they are locally created. Doesn't this create performance implications? I thought creating multiple QPs statically will have a over-all performance impact than dynamically creating them when needed.

Teams are used for collective operations.

Depending on the implementation, it looks like every team will have an implicit context or some form of resource for hardware collective engine created with it.

RaymondMichael commented 5 years ago

While it does mean creating a few IB resources, it's a lot less than you might think. Today I only need to create one QP at Context creation time; the rest can be lazily allocated.

khamidouche commented 5 years ago

I actually like that fact that we allocate the resources in advance during the team creation and then the context creation is just doing the assignment from the pool. For instance on GPUs I should avoid the dynamic resource allocation. so I would create all the resources before I start the GPU kernel.

wfaderhold21 commented 5 years ago

@nspark @jdinan I missed the call, but I am correct in understanding that this proposed solution will restrict teams to only collective operations? Is the NOCOLLECTIVE option going away?

jdinan commented 5 years ago

@wfaderhold21 Nope, NOCOLLECTIVE stays. Users can create a team that's used to create contexts, but won't be used for collectives.

jdinan commented 5 years ago

Slides presented at Aug. 2 working group meeting: 08-02-2018 -- Teams and Contexts.pdf

khamidouche commented 5 years ago

I had a conflict and missed the call. But do I understand that the team will pre-allocates "all the contexts" and then the cxt_cretae_team will just take from the pool?

jdinan commented 5 years ago

Yes, the idea is to enable implementations that want collective context allocation (or collective context-domain allocation) to accomplish this during team creation. From the discussion, this appears to be a good path for InfiniBand when using RC. With unconnected/datagrams models, the resource pre-allocation may not be necessary (e.g. it's not necessary on Omni-Path when using reliable datagrams).

wfaderhold21 commented 5 years ago

So, just to make sure I understand this:

Team creation is always collective now, including team_create_strided
NOCOLLECTIVE only indicates to the library that the team should not have a psync/pwork allocation or any collective offload initialization
Team creation also creates a pool of contexts for the team defined by SHMEM_TEAM_NUM_CTX
Users can retrieve a context with shmem_ctx_create_team
shmem_put parameters are "extended" to support contexts from the Teams and these are used as a PE mapping (e.g., even numbered PEs in a Team, then shmem_put with a context from that Team on PE 1 is actually on PE 2 from SHMEM_TEAM_WORLD)

now simple questions:

Why should team creation be collective (i.e., team_create_strided()) if NOCOLLECTIVE is used?
Why expose contexts allocated during team creation to the user? Why not allow the library to manage this?
Following 2, why overload concept of contexts for shmem_put instead of creating a shmem_team_put? Doesn't this require the implementation to always perform a PE translation on shmem_put even if using SHMEM_TEAM_WORLD? This translation would likely involve overhead in either space or time, which is not necessary.
Is there a default number of contexts for a team to create?

naveen-rn commented 5 years ago

@jdinan In this new proposal, we haven't figured out how to pass the number of contexts as an argument to the team creation operation. So, let us consider we pass only the number of private contexts for simplicity.

What is considered as a successful team creation? On return from the team creation call, does it guarantee that the asked number of private contexts are available to the users during subsequent context creation operation?
Are we planning to move the return value from the shmem_ctx_create to shmem_team_create operation?

Say, if the shmem_team_create operation doesn't guarantee anything about the resource availability - in this case it seems we are increasing the complexity of the resource management in over-allocation scenarios.

jdinan commented 5 years ago

@wfaderhold21 Your understanding is correct. With regard to your questions:

The implementation may still want to perform collective actions to set up context resources. This is the "wire-up" for connected transports that we have been discussing.
and 4. Implementations should support automatic resource management and users can provide additional information to help tune resource management. The default behavior is implementation-defined, but should ideally support any number of contexts. The approach taken so far by the specification involves hints and tuning knobs. If we prefer users to manage resources more explicitly, we should discuss domains.
I don't think this is as bad as overloading or introducing an aspect-like concept. Associating a context to a team is a generalization of the new API we introduced in 1.4. The PE index translation does add a small amount of overhead (time); we would need to accept this as a standardization body in order to move forward with this change. Bear in mind that this overhead is always there when you use a team, regardless of how we integrate them with the API. It sounds like you are suggesting that we have a separate API for SHMEM_TEAM_WORLD? Duplicating the point-to-point API involves a lot of work for implementations, especially to double the set of API routines that are being tested for correctness and performance regressions. It's much easier to have all APIs use one implementation, which is what I expect most implementors would do.

@naveen-rn You raise good points. I think we would want team creation options to be requirements, not hints. It's better to inform the user of a resource allocation issue as early as possible.

shamisp commented 5 years ago

@gmegan and I had the same concern as @naveen-rn. We also think that the option must be required and for some reason team can not be allocate we should return an error (TEAM_NULL?).

gmegan / specification