mpi-forum / mpi-issues

Tickets for the MPI Forum
http://www.mpi-forum.org/
67 stars 8 forks source link

ULFM: fault model is not completely defined #816

Open abouteiller opened 1 year ago

abouteiller commented 1 year ago

Problem

The fault model in the Fault Tolerance chapter is only alluded to. We had intentionally kept it somewhat blurry to give freedom to implementors in what fault types would manifest as MPI errors, but that has led to the fault model being insufficiently defined. Bill Gropp proposed that we defined the fault model as experienced by the user very firmly, and add an advice to implementors clarifying that they do have freedom in what fault types they can tolerate, but not on how it is exposed to the user.

Proposal

Specify fault model strictly in terms of user-visible behavior. Add an advice to implementor explaining what to do if they want to tolerate non-process failure fault types.

Changes to the Text

Impact on Implementations

No impact on implementation (beyond being more clear what to do).

Impact on Users

Clarification of expectation for both users and implementors.

References and Pull Requests

https://github.com/mpi-forum/mpi-standard/pull/947