mpi-forum / mpi-issues

Tickets for the MPI Forum
http://www.mpi-forum.org/
66 stars 7 forks source link

ULFM Fault Tolerance (slice 1: ack_failed, get_failed, revoke) #581

Open abouteiller opened 1 year ago

abouteiller commented 1 year ago

Problem

The monolithic ULFM proposal has been split in morsels so that the MPI Forum can focus on individual topics.

Main topic issue https://github.com/mpi-forum/mpi-issues/issues/20

Proposal

The first topic slice contains the following concepts for communicators:

Changes to the Text

Addition of an FT chapter containing the proposed constructs

Impact on Implementations

Implementations optionally to implement fault tolerance. Implementations to add procedures MPI_COMM_REVOKE, MPI_COMM_GET_FAILED, MPI_COMM_ACK_FAILED (implementations that do not support FT can provide stubs that are not fault tolerant).

Impact on Users

Users can receive fault events, write Manager-Worker applications, and run-through (e.g., Stencil, ABFT) type workloads that use only P2P operations (slice2 and 3 will add features for repairing communicators as needed to use collective and process respawning after a fault).

References and Pull Requests

https://github.com/mpi-forum/mpi-standard/pull/665

wesbland commented 1 year ago

I believe this is the set of changes for the no-no vote on 2022-09-30:

https://github.com/mpi-forum/mpi-standard/pull/665/files/92f7596e8958dfc3a71bbc83514dec3d3b7dcc07..43864f6499e0e8fdb25534625397761e256c7479

wesbland commented 1 year ago

This issue had a "no-no" vote on 2022-09-30, which passed:

Yes No Abstain
28 0 1
wesbland commented 1 year ago

This passed a first vote on 2022-12-07.

Yes No Abstain
28 0 5
wesbland commented 1 year ago

This passed a second vote on 2023-02-08.

Yes No Abstain
25 0 6