mpi-forum / mpi-issues

Tickets for the MPI Forum
http://www.mpi-forum.org/
67 stars 7 forks source link

ULFM Fault Tolerance (slice 2: agree) #582

Open abouteiller opened 2 years ago

abouteiller commented 2 years ago

Problem

The monolithic ULFM proposal has been split in morsels so that the MPI Forum can focus on individual topics.

Main topic issue https://github.com/mpi-forum/mpi-issues/issues/20

Proposal

The second topic slice contains the following concepts for communicators:

Changes to the Text

Addition of an FT chapter containing the proposed constructs

Read text (Sept'23) https://github.com/mpi-forum/mpi-standard/pull/715/commits/9e81233953a280f867eb48fbe890f5108a5ed9af no-no reading (diff from Sept'23) https://github.com/mpi-forum/mpi-standard/pull/715/commits/58283a760a35934c0331744f8245e552644d252a

Impact on Implementations

Implementations optionally to implement fault tolerance. Implementations to add procedures MPI_COMM_AGREE (implementations that do not support FT can provide stubs that are not fault tolerant based on MPI_ALLREDUCE).

Impact on Users

Users can react to fault events, validate progress in collective phases, and synchronize knowledge of failures across ranks. (slice 3 will add features for repairing communicators as needed to use collective and process respawning after a fault).

References and Pull Requests

https://github.com/mpi-forum/mpi-standard/pull/715

wesbland commented 1 month ago

This passed a no-no vote on 2023-12-04.

Yes No Abstain
27 0 3
wesbland commented 1 month ago

This failed ballot quorum on a 1st vote on 2023-12-05.

Yes No Abstain
17 4 9