open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.13k stars 858 forks source link

OMPI/COMM: be more conservative about when a comm is ready #12612

Closed hppritcha closed 3 months ago

hppritcha commented 3 months ago

Only return a pointer to a ompi_commuicator_t struct now from

when the communicator has a PML associated with it.

The ompi_comm_lookup function will continue to return NULL if the entry designated by the c_index argument in the ompi_mpi_communicator table is OMPI_COMM_SENTINEL.

OLD COMMIT MESSAGE BEFORE REFACTOR

This patch addresses a race condition in OB1. One way this race condition is encountered is when using MPI_Comm_spawn under oversubscribed conditions.

The fundamental reason for this race condition existing is that the CID allocation procedure for intercommunicators does not have a barrier in the onpi_comm_activate_nb procedure. As a result, it is possible for a process to receive a fragement (message) from another process participating in the spawn procedure and still be in the cid allocation procedure (within ompi_comm_next_cid_nb). The process may have allocated a suitable slot in the ompi_mpi_communicators table but not yet associated it with a PML.

So in this code path it is necessary to check both

a valid cid for the incoming message headers' context is present in ompi_mpi_communicators and a PML is associated with this communicator. This problem is specific to inter communicators at the time of this PR as intracommunicators have a barrier like behavior in ompi_comm_activate_nb.

Signed-off-by: Howard Pritchard howardp@lanl.gov (cherry picked from commit 651ef79e713b7977933a447dba7ed1ff61ec3c6a)