Implementations optionally to implement fault tolerance.
Implementations to add procedures MPI_COMM_AGREE (implementations that do not support FT can provide stubs that are not fault tolerant based on MPI_ALLREDUCE).
Impact on Users
Users can react to fault events, validate progress in collective phases, and synchronize knowledge of failures across ranks. (slice 3 will add features for repairing communicators as needed to use collective and process respawning after a fault).
Problem
The monolithic ULFM proposal has been split in morsels so that the MPI Forum can focus on individual topics.
Main topic issue https://github.com/mpi-forum/mpi-issues/issues/20
Proposal
The second topic slice contains the following concepts for communicators:
Changes to the Text
Addition of an FT chapter containing the proposed constructs
Read text (Sept'23) https://github.com/mpi-forum/mpi-standard/pull/715/commits/9e81233953a280f867eb48fbe890f5108a5ed9af no-no reading (diff from Sept'23) https://github.com/mpi-forum/mpi-standard/pull/715/commits/58283a760a35934c0331744f8245e552644d252a
Impact on Implementations
Implementations optionally to implement fault tolerance. Implementations to add procedures MPI_COMM_AGREE (implementations that do not support FT can provide stubs that are not fault tolerant based on MPI_ALLREDUCE).
Impact on Users
Users can react to fault events, validate progress in collective phases, and synchronize knowledge of failures across ranks. (slice 3 will add features for repairing communicators as needed to use collective and process respawning after a fault).
References and Pull Requests
https://github.com/mpi-forum/mpi-standard/pull/715