This will rarely result in the process rank 1 blocking in Send(..., to=5,...), where 5 is the failed process, inside MPI_Bcast for iteration 4. The comm is marked as revoked and the peer failed, but the Send remains in ompi_request_wait_completion...opal_progess forever.
Original report by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).
Running
~/DEVEL/ompi/ulfm2.debug/bin/mpirun -n 10 -oversubscribe -mca btl tcp,self 12.buddycr -v
This will rarely result in the process rank 1 blocking in
Send(..., to=5,...)
, where 5 is the failed process, insideMPI_Bcast
for iteration 4. The comm is marked as revoked and the peer failed, but the Send remains inompi_request_wait_completion...opal_progess
forever.