Clarify the collective functionality of the Team creation and Team destroy routines

naveen-rn commented 6 years ago

Let me know, if my understanding is correct.

On Teams created with SHMEM_TEAM_NOCOLLECTIVE:

parent team doesn't need to be collective
team creation call is not collective

On Teams created without SHMEM_TEAM_NOCOLLECTIVE:

parent team doesn't need to be collective
team creation call is collective only across participating PE - every other participating PE will return SHMEM_TEAM_NULL

On Team destroy operation:

Irrespective of how the teams are created with or without SHMEM_TEAM_NOCOLLECTIVE - this call is non-collective

nspark commented 6 years ago

On Team destroy operation:

Irrespective of how the teams are created with or without SHMEM_TEAM_NOCOLLECTIVE - this call is non-collective

I think team_destroy is collective by default. It is only non-collective when the team has the SHMEM_TEAM_NOCOLLECTIVE option.

https://github.com/gmegan/specification/blob/7f5aca8723fc23b7cb0bb7cc3c842c78204bf144/content/shmem_team_destroy.tex#L16-L18

gmegan commented 6 years ago

I am trying to find the best place to put this explanation in the spec document. It seems that it should go in the Teams section intro text since it mostly applies to team create/destroy in that section, rather than putting it in the collectives section.

naveen-rn commented 6 years ago

team creation call is collective only across participating PE - every other participating PE will return SHMEM_TEAM_NULL

@nspark @shamisp Can you clarify whether this is the right semantics. I'm bit confused on how to design this semantics efficiently.

Can we create two teams (TEAM:1 and TEAM:2) concurrently from TEAM_WORLD? Consider that the team creation operation is ordered correctly in PE-1.

TEAM_WORLD: PE={1, 2, 3, 4, 5, 6, 7, 8} TEAM:1 PE={1, 2, 3} TEAM:2 PE={1, 4, 8}

nspark commented 6 years ago

@naveen-rn In your example, TEAM:1 and TEAM:2 are not disjoint, so they cannot be created concurrently.

I'll tweak your specific PE assignments to simplify my code logic and so I can use shmem_team_split_strided(), but here are two instances of your example in code:

// Goal: team1 = {0, 1, 2}, team2 = {0, 3, 6}
shmem_team_t team1 = SHMEM_TEAM_NULL;
shmem_team_t team2 = SHMEM_TEAM_NULL;

{ // Method 1
  shmem_team_split_strided(SHMEM_TEAM_WORLD, 0, 1, 3, &team1);
  shmem_team_split_strided(SHMEM_TEAM_WORLD, 0, 3, 3, &team2);
}

{ // Method 2
  if (shmem_my_pe() < 3)
    shmem_team_split_strided(SHMEM_TEAM_WORLD, 0, 1, 3, &team1);
  if ((shmem_my_pe() % 3) == 0)
    shmem_team_split_strided(SHMEM_TEAM_WORLD, 0, 3, 3, &team2);
}

In both cases, PE 0 -- the only PE in both teams -- creates team1 before team2; however, PEs 1, 2, 3, and 6 are all trying to concurrently create a team with PE 0, which introduces the conflict or race between these concurrent operations. The simplest solution would be to call shmem_barrier_all() between the (conditional) creation of the two teams.

spophale commented 6 years ago

The above is a problem only when both team1 and team2 are intended for collectives ?

nspark commented 6 years ago

@spophale No, I think it's still a problem. Originally, I think the intent was that shmem_team_split(NOCOLLECTIVE) would be a local operation, but, with the team-context interaction, I think there is a collective nature to team creation even when the team will not itself be used for collectives. The NOCOLLECTIVE option then serves the purpose of reducing use of internal resources (e.g., internal pSync and pWork structures).

shamisp commented 6 years ago

Even in respect to team-context interaction we may potentially avoid global synchronization if the system is symmetric or and datagram is used. @spophale there is no way around with any solution unless you allocate context as a collective operations which is leads to all kinda other constrains.

Technically, the mandate of team proposal was extended to address issues related to context creation with point-to-point transports. So it is teams + context proposal :)

gmegan commented 6 years ago

Just posted PR #41 for this issue.

PDF: main_spec.pdf

There is a lot of text to deal with the special case of teams that don't communicate. This happens when one creates a team with SHMEM_TEAM_NOCOLLECTIVE and 0 contexts. The main case where this will happen is in doing some number of 2D splits where there are intermediate teams that do nothing but get split again.

It seems to make more sense to define some option like SHMEM_TEAM_STUB to encapsulate this behavior? As it is, the user would have to track down all of the conditions that trigger need to synchronize. Currently that is just contexts and collective support, but when teams have heaps or other shared structures this will not be backward compatible.

gmegan commented 6 years ago

Fixed typo in PR #41

main_spec.pdf

gmegan commented 6 years ago

Merged PR, closing this issue

gmegan / specification

Clarify the collective functionality of the Team creation and Team destroy routines #33