Open abouteiller opened 6 years ago
Original report by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).
Collective operations implemented in the component coll:inter are not really fault tolerant. In many place we find the following pattern
if(MPI_SUCCESS) != MPI_Gather(local_comm)) goto exit; if(MPI_SUCCESS != MPI_Sendrecv(inter_comm)) goto exit; if(MPI_SUCCESS != MPI_Bcast(local_comm)) goto exit;
Such a code pattern results in potentially deadlocking post-failure code (as the processes that detect a failure during the gather may not participate in their part of the sendrecv on a different comm -> deadlock).
Original report by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).
Collective operations implemented in the component coll:inter are not really fault tolerant. In many place we find the following pattern
Such a code pattern results in potentially deadlocking post-failure code (as the processes that detect a failure during the gather may not participate in their part of the sendrecv on a different comm -> deadlock).