ulfm-devel / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
0 stars 0 forks source link

MPI_Comm_spawn deadlock w/faults #26

Open abouteiller opened 6 years ago

abouteiller commented 6 years ago

Original report by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


Multiple cases of deadlock in MPI_Comm_spawn with faults.

  1. MPI_Comm_spawn on a comm containing failed processes: the initiator group fails as expected (MPIX_ERR_PROC_FAILED returned), the intercomm is NULL, yet the new processes are still spawned (and then deadlock, of course, because the initiators do not participate in the second part of spawn).
  2. MPI_Comm_spawn fails because "not enough slots available" when a process fails, and the recovery is so hasty that the SPAWN comes in before PMIx has cleaned up the slots on the dead processes. After this condition arises, the rest of the code deadlocks in spawn.
abouteiller commented 6 years ago

Original comment by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


Condition 2) is a bug in upstream:

  1. The spawn root detects that "no slot available", leaves the spawn in the middle.
  2. non-root spawners will deadlock in the connect-accept bcast for the group exchange.
abouteiller commented 6 years ago

Original comment by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


This is problematic but we can release without fixing. Problem can be fixed by using appropriate mpirun flags, meanwhile the issue in spawn is a rare scenario.

abouteiller commented 5 years ago

Original comment by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


Bug 2. still present.