The fault model in the Fault Tolerance chapter is only alluded to. We had intentionally kept it somewhat blurry to give freedom to implementors in what fault types would manifest as MPI errors, but that has led to the fault model being insufficiently defined.
Bill Gropp proposed that we defined the fault model as experienced by the user very firmly, and add an advice to implementors clarifying that they do have freedom in what fault types they can tolerate, but not on how it is exposed to the user.
Proposal
Specify fault model strictly in terms of user-visible behavior.
Add an advice to implementor explaining what to do if they want to tolerate non-process failure fault types.
Changes to the Text
Impact on Implementations
No impact on implementation (beyond being more clear what to do).
Impact on Users
Clarification of expectation for both users and implementors.
Problem
The fault model in the Fault Tolerance chapter is only alluded to. We had intentionally kept it somewhat blurry to give freedom to implementors in what fault types would manifest as MPI errors, but that has led to the fault model being insufficiently defined. Bill Gropp proposed that we defined the fault model as experienced by the user very firmly, and add an advice to implementors clarifying that they do have freedom in what fault types they can tolerate, but not on how it is exposed to the user.
Proposal
Specify fault model strictly in terms of user-visible behavior. Add an advice to implementor explaining what to do if they want to tolerate non-process failure fault types.
Changes to the Text
Impact on Implementations
No impact on implementation (beyond being more clear what to do).
Impact on Users
Clarification of expectation for both users and implementors.
References and Pull Requests
https://github.com/mpi-forum/mpi-standard/pull/947