ulfm-devel / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
0 stars 0 forks source link

TCP BTL triggers false detection #21

Closed abouteiller closed 6 years ago

abouteiller commented 6 years ago

Original report by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


c00.cauchy:29481] [[9733,1],1] ompi_rbcast_bml_send_complete_cb: status -12
[c00.cauchy:29481] PML:OB1: the error handler was invoked by the tcp BTL for proc [[9733,1],0] with info Socket closed
[c00.cauchy:29481] [[9733,1],1] ompi: Process [[9733,1],0] failed (state = -57).

The modifications to render the TCP BTL resilient seem to be excessive and cause the BTL to trigger failure events for operations with normal processes in some instances.

abouteiller commented 6 years ago

Original comment by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


c00.cauchy:29481] [[9733,1],1] ompi_rbcast_bml_send_complete_cb: status -12
[c00.cauchy:29481] PML:OB1: the error handler was invoked by the tcp BTL for proc [[9733,1],0] with info Socket closed
[c00.cauchy:29481] [[9733,1],1] ompi: Process [[9733,1],0] failed (state = -57).
abouteiller commented 6 years ago

Original comment by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


Possibly a bug inherited from upstream. George is working on a fix for Open MPI.

abouteiller commented 6 years ago

Original comment by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


Problem solved in patches 0237a707 61c5954f