Team creation without intervening synchronization

gmegan commented 5 years ago

Currently, it is the case that:

Team split operations are collective w.r.t. all resulting child teams. In the case where there are multiple resulting child teams from a single split operation, the operation is collective w.r.t. the union of all PEs in all child teams.
A PE that is not part of the parent team but will not be a member of any resulting child team may call the collective routine. It will have all resulting child team handled set to SHMEM_TEAM_NULL in a local operation.
The application must ensure that no two split operations of a given parent team resulting in child teams containing some shared PE n are called simultaneously. In other words if team T0 is being split by two subsequent split operations into two new teams T1 and T2, both containing PE n, then PE n cannot receive any traffic from the split operation resulting in T2 before completing the split operation resulting in T1.

An example of a violation of the final condition is:

int my_pe = shmem_my_pe();
if ((my_pe % 2) == 0)
   shmem_team_split_strided(SHMEM_TEAM_WORLD, 0, 2, shmem_n_pes() / 2, NULL, 0, t2);
if ((my_pe % 3) == 0)
   shmem_team_split_strided(SHMEM_TEAM_WORLD, 0, 3, shmem_n_pes() / 3, NULL, 0, t3);

This code creates a case where PE 6 could still be involved in the split resulting t1 and simultaneously receive traffic from PE 3 related to the split resulting in t2

It would be preferable to remove the third split condition such that the above code would be legal.

gmegan commented 5 years ago

A possible solution to the above problem is to have the library implementation generate a unique tag for split operations such that PE 6 could receive traffic for both split operations simultaneously and be able to differentiate the traffic by tag to determine if the traffic relates the the split into t1 or the split into t2.

A way to conceptualize this would be to think of implementing the split operations in terms of

MPI_Comm_create_group(MPI_Comm comm, MPI_Group group, int tag, MPI_Comm* newcomm))

where comm is the parent team, group describes the new set of PEs to be added to the child team, the tag is internally generated by the implementation, and newcomm is the child team

The question is: Can we generate a unique tag from the existing information passed through the split API?

The split operation provides the following information that could be used for tagging:

start
stride
size
parent_team handle

Is there a case where all of this information is identical between subsequent calls to split, such that any tag created from this information would be identical between subsequent calls to split?

If we assume team split is a synchronizing operation, then in any case where PE n is involved in a split into a child team with some (parent, start, stride, size), we know that all other PEs involved in that split must enter the split before any can exit. It seems like this would take care of the situation of subsequent splits into the exact same set of PEs.

For example:

#pragma omp parallel
#pragma omp for
for (i=0; i<omp_num_threads(); i++) {
  if (omp_get_thread_num() == i) {
    shmem_team_split_strided(SHMEM_TEAM_WORLD, 0, 2, shmem_n_pes() / 2, NULL, 0, &td_team);
  }
#pragma omp barrier
}

This code creates teams inside of threads that contain all of the threads with the same thread ID across SHMEM_TEAM_WORLD. There are subsequent splits into overlapping child teams from SHMEM_TEAM_WORLD, but the splits are synchronizing across all PEs in the child teams. So, for example, the team for thread ID 1 will not be entered by PE 0 until all PEs have entered the split to create the team for thread 0. With the synchronizing guarantee it seems like the implementation can ensure that no even PE exits the split until all all even PEs are ready for the next split.

If this is the case, then the synchronization problem only comes up for subsequent splits into overlapping teams with different membership, as in the example above about splitting into teams based on multiples of 2s and 3s.

What is the minimal amount of information needed to generate a unique tag? Does it fit into an int?

naveen-rn commented 5 years ago

If we keep the option to design a unique tag generation out of discussion for while, the main issue is that the tags are communicated via the same buffer. So, how do we avoid messages overlapping each other from different split operations? Aren't there a possibility of missing a message and resulting in deadlock? Are we supposed to perform an internal barrier with the participating child team PEs before performing a split operation? If so, what is the safe way to perform this barrier operation?

To me tag generation is a secondary issue. I don't follow the main flow of this algorithm.

naveen-rn commented 5 years ago

On completely overlapping teams, may be my concern doesn't matter. But, with partially overlapping child teams - my previous concern seems to be an issue.

gmegan commented 5 years ago

Will reopen new issue with better problem statement

gmegan / specification

Team creation without intervening synchronization #111