ulfm-devel / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
0 stars 0 forks source link

OSX: Send to failed process on BTL TCP may deadlock #43

Closed abouteiller closed 4 years ago

abouteiller commented 5 years ago

Original report by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


Running ~/DEVEL/ompi/ulfm2.debug/bin/mpirun -n 10 -oversubscribe -mca btl tcp,self 12.buddycr -v

This will rarely result in the process rank 1 blocking in Send(..., to=5,...), where 5 is the failed process, inside MPI_Bcast for iteration 4. The comm is marked as revoked and the peer failed, but the Send remains in ompi_request_wait_completion...opal_progess forever.

abouteiller commented 4 years ago

Original comment by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


Issue resolved by one of the changes to TCP error handling (EPIPE, event_del, thread safety, etc).