Only return a pointer to a ompi_commuicator_t struct now from
ompi_comm_lookup
ompi_comm_lookup_cid
when the communicator has a PML associated with it.
The ompi_comm_lookup function will continue to return NULL if the entry designated by the c_index argument in the ompi_mpi_communicator table is OMPI_COMM_SENTINEL.
OLD COMMIT MESSAGE BEFORE REFACTOR
This patch addresses a race condition in OB1. One way this race condition is encountered is when using MPI_Comm_spawn under oversubscribed conditions.
The fundamental reason for this race condition existing is that the CID allocation procedure for intercommunicators does not have a barrier in the onpi_comm_activate_nb procedure. As a result, it is possible for a process to receive a fragement (message) from another process participating in the spawn procedure and still be in the cid allocation procedure (within ompi_comm_next_cid_nb). The process may have allocated a suitable slot in the ompi_mpi_communicators table but not yet associated it with a PML.
So in this code path it is necessary to check both
a valid cid for the incoming message headers' context is present in ompi_mpi_communicators and a PML is associated with this communicator.
This problem is specific to inter communicators at the time of this PR as intracommunicators have a barrier like behavior in ompi_comm_activate_nb.
Signed-off-by: Howard Pritchard howardp@lanl.gov
(cherry picked from commit 651ef79e713b7977933a447dba7ed1ff61ec3c6a)
Only return a pointer to a ompi_commuicator_t struct now from
when the communicator has a PML associated with it.
The ompi_comm_lookup function will continue to return NULL if the entry designated by the c_index argument in the ompi_mpi_communicator table is OMPI_COMM_SENTINEL.
OLD COMMIT MESSAGE BEFORE REFACTOR
This patch addresses a race condition in OB1. One way this race condition is encountered is when using MPI_Comm_spawn under oversubscribed conditions.
The fundamental reason for this race condition existing is that the CID allocation procedure for intercommunicators does not have a barrier in the onpi_comm_activate_nb procedure. As a result, it is possible for a process to receive a fragement (message) from another process participating in the spawn procedure and still be in the cid allocation procedure (within ompi_comm_next_cid_nb). The process may have allocated a suitable slot in the ompi_mpi_communicators table but not yet associated it with a PML.
So in this code path it is necessary to check both
a valid cid for the incoming message headers' context is present in ompi_mpi_communicators and a PML is associated with this communicator. This problem is specific to inter communicators at the time of this PR as intracommunicators have a barrier like behavior in ompi_comm_activate_nb.
Signed-off-by: Howard Pritchard howardp@lanl.gov (cherry picked from commit 651ef79e713b7977933a447dba7ed1ff61ec3c6a)