ulfm-devel / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
0 stars 0 forks source link

Intercomms collective not really fault tolerant #35

Open abouteiller opened 6 years ago

abouteiller commented 6 years ago

Original report by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


Collective operations implemented in the component coll:inter are not really fault tolerant. In many place we find the following pattern

if(MPI_SUCCESS) != MPI_Gather(local_comm)) goto exit;
if(MPI_SUCCESS != MPI_Sendrecv(inter_comm)) goto exit;
if(MPI_SUCCESS != MPI_Bcast(local_comm)) goto exit;

Such a code pattern results in potentially deadlocking post-failure code (as the processes that detect a failure during the gather may not participate in their part of the sendrecv on a different comm -> deadlock).