ulfm-devel / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
0 stars 0 forks source link

Open IB post-fault credit release is slow #31

Closed abouteiller closed 3 years ago

abouteiller commented 6 years ago

Original report by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


The credits spent toward a failed node are returned only when the pending frags eventually timeout. If the application is communication intensive at the time towards the process that fails, it may run out of credits altogether.

Problem is that the default timeout is somewhat long (10/15s or so). The credits eventually get released but in the meantime the application may stall all progress from lack of send-credits.

Simply decreasing the IB timeouts and retry counts is too dangerous, as moving it down to a value under the pain threshold for post-failure stall leads to processes being randomly reported as failed.

We need to figure out a better way to release the credits earlier (i.e. as soon as we know the target process failed, not when the associated fragments timeout).

abouteiller commented 3 years ago

Rendered obsolete by removal of OpenIB from Open MPI