ulfm-devel / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
0 stars 0 forks source link

Dealing with node-level failures #34

Closed abouteiller closed 5 years ago

abouteiller commented 6 years ago

Original report by George Bosilca (Bitbucket: bosilca, GitHub: bosilca).


As reported on the mailinglist when a node goes down completely, the fault are not correctly detected, or at least are not detected in a reasonable amount of time.

abouteiller commented 6 years ago

Original comment by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


Retuned the default detector parameters and made them visible in ompi_info in commit 228c12a waiting on users' feedback.

abouteiller commented 6 years ago

Original comment by George Bosilca (Bitbucket: bosilca, GitHub: bosilca).


I have added comments to the commit regarding the timeout value.

abouteiller commented 5 years ago

Original comment by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


This has been addressed.