Resolve context-team interaction issues

nspark commented 5 years ago

The current draft of the teams proposal puts the PE team inside the communication context and allows for the reassignment of teams to contexts. This issue is meant to detail the proposed model and potential alternatives and capture feedback or issues with any of these models as well as other interactions between teams and contexts.

Teams in Contexts

Each context is associated with a team.
Each team may be assigned to zero or more contexts.
By default (i.e., at context creation), each context is associated with SHMEM_TEAM_WORLD.
A context may have its associated team updated (e.g., via shmem_ctx_set_team).

Potential advantages

Context-based point-to-point operations can use team-relative PE numbering.
Updating the team associated to a context could be as simple as a pointer assignment.
Contexts represent semi-virtualized hardware resources. Separating context and team creation and allowing the reassignment of teams to context, could minimize underutilized resources.

Disadvantages or concerns

From https://github.com/gmegan/specification/issues/28#issuecomment-401832971:
- Networks may need coordinated source- and target-side resources to enable full messaging throughput (e.g., in multi-HCA/HFI setups).
- OpenSHMEM implementations that share lower implementation layers with MPI or ones that seek to provide MPI interoperability may want to share resources and implementation paths, potentially backing a SHMEM team with an MPI communicator.
- Networks with hardware collective acceleration may need these resources to be allocated collectively.
Concurrent collective calls from multiple threads may not be possible (e.g., any internal pSync or pWork state is per-PE, not per-thread).

Contexts in Teams

Each team is associated with a context.
Each context may be assigned to zero or more contexts.
At team creation, the associated context is provided and fixed.

Potential advantages

Intended to address the concerns of "Teams in Contexts"
Could support concurrent collective calls (e.g., each thread has a team, so internal state is thread/team-specific).

Disadvantages or concerns

Generally a more-restrictive model.
May lead to creation of more teams (i.e., consumption of more resources) than with Teams-in-Contexts. For example, a software communication pipeline that wants to use N contexts on one logical team will need to create N team instances of the same logical team.
Team-relative point-to-point operations would require a new shmem_team_*-prefixed API or manual PE index translation.

Teams use `SHMEM_CTX_DEFAULT` ("Plan B")

Teams and contexts are created and managed separately.
All teams are implicitly associated with the default context (e.g., SHMEM_CTX_DEFAULT).
Team-based collective operations added.
Context-based point-to-point operations exist as-is (i.e., use "world" PE numbering); no team-based point-to-point operations are added.

Disadvantages or concerns

No opportunity for communication concurrency in collectives.
Team-relative point-to-point operations require manual PE index translation.

Context from Teams (updated 7/23)

Given a team, a team-specific context can be created (e.g., shmem_team_create_ctx).
Concurrent collectives on the same logical team require distinct team objects.
Operations on contexts are performed relative to the associated team. The default context and local semantic contexts (shmem_ctx_create) are associated with SHMEM_TEAM_WORLD.

Use Model

Program creates a new team: shmem_team_split_* (collective operation)
- Internal pSync and pWrk state lives with the team.
- Depending on the network, hardware offload engine state may live with the team.
Program creates team-based context(s) from that team: shmem_team_create_ctx (collective operation)
- Depending on the network, hardware offload engine state may live with the context.
Program uses team-based context(s), local-semantic contexts (shmem_ctx_create), or the default context for RMA, AMO, collective, and synchronization operations
- P2P operations are performed with team-based PE numbering. (SHMEM_TEAM_WORLD is the associated team for all local-semantic contexts and the default context.)
- Collective operations may be performed on any context, however, the team associated with the context is the point of serialization.
Program destroys team-based context(s): shmem_team_destroy_ctx (collective operation)
Program destroys the team: shmem_team_destroy (collective operation)

Potential advantages

Allows for collective wire-up of any required resources.
Allows for communication concurrency with concurrent collectives (using distinct team objects)

Disadvantages or concerns

???

Open questions + design points

What is the "final" handle for team-relative operations (including RMA, AMO, and collectives)?
- ~~If it is the team, then...~~
- If it is the context, then...
- Using distinct context handles -- even with each created from shmem_team_create_ctx -- may not be safe for concurrent collectives. They would need to be distinct contexts created from distinct teams. Not a concern; this is consistent with the requirement for global serialization of collectives using the same active set in 1.4.

nspark commented 5 years ago

On yesterday's Teams WG call, we discussed adding a configuration structure and associated argument to specify the use of the team. This post is intended to continue the discussion about the fields in such a configuration structure; please pick the following apart and share your thoughts.

typedef struct {
  int disable_collectives;  // zero for default behavior (collectives supported);
                            // nonzero to disable collective support

  int return_local_limit;   // zero indicates library should return (to all PEs) the
                            // constraining value across all PEs (e.g., MIN-reduce);
                            // nonzero indicates library should return the constraining
                            // value of the calling PE

  int num_threads;          // # of threads that may create contexts
} shmem_team_config_t;

// The `config` argument is an input and output, and the function returns a
// status code indicating whether team creation was successful. Specifically:
//
//   - On input, `config` specifies the resource and behavioral requirements of
//     the team that is to be created.
//
//   - If the team is created successfully, the function returns zero and `config`
//     is not modified.
//
//   - If the team is not created successfully, the function returns a nonzero value
//     and `config` is modified to return the locally or globally constraining
//     values, as determined by `config->return_local_limit`.

int shmem_team_split_strided(shmem_team_t parent_team,
                             int PE_start, int PE_stride, int PE_size,
                             shmem_team_config_t *config, shmem_team_t *new_team);

Toy example:

int max_threads = omp_get_max_threads();
shmem_team_config_t config = { .num_threads = max_threads };
shmem_team_t team;
while (shmem_team_split_strided(SHMEM_TEAM_WORLD, /* ... */,
                                &config, &team)) {
  // Requested too many threads; loop until that is not a constraint.
  if (config.num_threads == 0)
    shmem_global_exit(1);
}
#pragma omp parallel num_threads(config.num_threads)
{
  shmem_ctx_t ctx;
  shmem_team_create_ctx(team, 0, &ctx);
  // ...
}

shamisp commented 5 years ago

For backward/forward compatibility reasons you want to add something like:

enum shmem_team_config_field {
    SHMEM_COLLECTIVE_MODE = 1<<0,
    SHMEM_LOCAL_LIMIT = 1<<1,
    SHMEM_NUM_THREADS = 1 << 2
};

int shmem_team_split_strided(shmem_team_t parent_team,
                             int PE_start, int PE_stride, int PE_size,
                             shmem_team_config_t *config, shmem_team_config_field_t field, shmem_team_t *new_team);

nspark commented 5 years ago

@shamisp I think that's a fair feature to consider. I'd generally like to make sure that a static-initialized shmem_team_config_t structure provides the "default" settings for team creation. That way, an initializer like:

shmem_team_config_t config = { .num_threads = max_threads };

would set num_threads as specified and imply the default for everything else. This would imply that a field mask as part of the structure itself would need to be interpreted with zero as a special value meaning "use all the fields".

I would expect that such a default setting would include (1) collectives are enabled (i.e., disable_collectives == 0) and (2) the globally-constraining limit is returned on team-creation failure (i.e., return_local_limit == 0), though I'm not sure what the interpretation of num_threads == 0 should be. Ideally, I think the implementation would inspect the CPU mask and interpret num_threads as the current number of threads in the CPU mask. Unfortunately, functions like sched_getaffinity and CPU_COUNT are glibc extensions for Linux; not exactly portable, though BSD and OS X have similar capabilities.

nspark commented 5 years ago

Draft PDF as of 62feca0 to close this issue (PR #35)

nspark commented 5 years ago

Draft PDF as of a91ed96

RaymondMichael commented 5 years ago

When it comes to whether delete is collective or not, I have no strong opinion. That said, libpsm2's endpoint shutdown code runs a lot faster when everyone is shutting down their endpoints at the same time. This may be true of other interconnects as well.

nspark commented 5 years ago

Some TODO items (or, at least, suggestions) from today's discussion:

Require destruction of team-based contexts before team destruction
Rename config.num_threads to config.max_contexts
Add SHMEM_MAX_CONTEXTS env-var as helper for shmem_init and shmem_ctx_create

shamisp commented 5 years ago

TODO, from my list: shmem_team_get_config -> change return code from void to int. Potentially user may pass invalid value and function fail.

split functions - provide an option to use "parent" resources instead of allocating new resources

gmegan commented 5 years ago

Current up to date pdf for this issue: main_spec.pdf

gmegan commented 5 years ago

Closing this since the solution has stood for a while. We can reopen this or create new issue if this solution is rejected.

gmegan / specification