mpi-forum / mpi-forum-historic

Migration of old MPI Forum Trac Tickets to GitHub. New issues belong on mpi-forum/mpi-issues.
http://www.mpi-forum.org
2 stars 3 forks source link

User-Level Failure Mitigation #323

Open mpiforumbot opened 8 years ago

mpiforumbot commented 8 years ago

Originally by bosilca on 2012-02-24 16:36:42 -0600


323-E: User-Level Failure Mitigation## Votes

no votes yet.

Description

This chapter describes a flexible approach, providing process fault tolerance by allowing the application to react to failures, while maintaining a minimal execution path in failure-free executions. The focus is on returning control to the application by avoiding deadlocks due to failures within the MPI library.

Some rationale and details are in the Wiki: https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/User_Level_Failure_Mitigation

-The working repository for the text of the document is available on bitbucket* (read access given on request) https://bitbucket.org/bosilca/mpi3ft

-Note: the versions attached to the ticket are updated loosely and are not always current. Please demand access to the repository to gain access to the latest revision, especially between forum meetings.*

More information on the prototype implementation in Open MPI can be found here: http://fault-tolerance.org/

Proposed Solution

Process errors denote impossibility to provide normal MPI semantic during an operation (as observed by a particular process). Specify clearly error classes returned in this scenario, provide new APIs for applications to obtain a consistent view of failures, add new APIs to create replacement communication objects to replace damaged objects.

Impact on Implementations

Adds semantic and functions to communicator operations. Implementations that do not care about fault tolerance have to provide all the proposed functions, with the correct semantic when no failure occur. However, an implementation that never raise an exception related to process failures does not have to actually tolerate failures.

Impact on Applications / Users

Provides fault tolerance to interested users. Users/implementations that do not care about fault tolerance are not impacted. Performance and code are unchanged.

Alternative Solutions

Stronger consistency models are more convenient to users, but much more expensive. These can be implemented on top of this proposal as user libraries (or potential future candidates to standardization, without conflict).

History

This submission gathers the three separate tickets for different topics (RMA, I/O, dynamic processes) that were present in the past. Everything is now inside this single document and ticket.

[wiki:"ft/run_through_stabilization" run-through stabilization] proposal was a complete different effort. This current ticket represents a ground-up restart, accounting for the issues raised during this previous work.

Entry for the Change Log

mpiforumbot commented 8 years ago

Originally by bosilca on 2012-02-24 16:37:00 -0600


Attachment added: mpi3ft.pdf (155.7 KiB)

mpiforumbot commented 8 years ago

Originally by davesolt on 2012-02-27 11:12:11 -0600


Reviewed by Dave Solt.

mpiforumbot commented 8 years ago

Originally by bosilca on 2012-02-27 11:43:39 -0600


Attachment added: ticket323.pdf (2319.1 KiB)

mpiforumbot commented 8 years ago

Originally by jjhursey on 2012-02-27 11:59:09 -0600


Reviewed by Josh Hursey

mpiforumbot commented 8 years ago

Originally by bgoglin on 2012-02-27 13:35:01 -0600


Reviewed by Brice Goglin

mpiforumbot commented 8 years ago

Originally by davesolt on 2012-02-27 14:43:44 -0600


Reviewed after latest update. Some minor ticket 0 items handed back for future consideration.

mpiforumbot commented 8 years ago

Originally by jjhursey on 2012-03-05 18:32:17 -0600


Attachment added: FT Chapter.pdf (931.5 KiB) Reading Text with Ticket 0 text changes applied (no markings) - Just FT Chapter

mpiforumbot commented 8 years ago

Originally by jjhursey on 2012-03-05 18:35:18 -0600


The full document with the ticket 0 applied - as read during this meeting is available at the link below (too big to attach directly to this ticket): http://osl.iu.edu/~jjhursey/public/mpi-forum/ticket323/mpi-report-ticket-323-ticket-0-applied.pdf

mpiforumbot commented 8 years ago

Originally by jjhursey on 2012-03-06 21:33:24 -0600


Attachment added: mpi-report-323-revised-2012-03-06.pdf (2324.7 KiB) New 323 ticket for reading on March 7, 2012

mpiforumbot commented 8 years ago

Originally by bouteill on 2012-03-30 16:22:22 -0500


Attachment added: mpi3forum.pdf (2881.5 KiB) The slides of the first reading, used during the Chicago meeting (3-5-12)

mpiforumbot commented 8 years ago

Originally by jjhursey on 2012-05-01 14:48:36 -0500


Attachment added: 2012-05-01-mpi-ft-draft.pdf (2324.7 KiB) An updated draft of the document to be used for teleconf discussions

mpiforumbot commented 8 years ago

Originally by jjhursey on 2012-05-21 12:05:57 -0500


A link to the Open MPI prototype Beta 1 release:

mpiforumbot commented 8 years ago

Originally by bouteill on 2012-05-21 19:07:34 -0500


Attachment added: mpi-report-17clean1.pdf (2325.5 KiB) Ticket 0 changes from previous reading, without renamings - for review by the WG

mpiforumbot commented 8 years ago

Originally by bouteill on 2012-05-23 13:57:15 -0500


Attachment added: mpi-report-17complete1.pdf (2327.7 KiB) The complete ticket323 with all ticket 0 changes from previous reading. Some are automatic renaming and are cluttering. For reference, readers are enjoined to use the "clean" version.

mpiforumbot commented 8 years ago

Originally by bouteill on 2012-05-23 14:14:07 -0500


Documents for Japan Reading

The two following attachments are to be used during the reading in Japan next week.

mpiforumbot commented 8 years ago

Originally by bosilca on 2012-05-28 22:45:00 -0500


Attachment added: MPI3ft-paragraph-change.pdf (656.3 KiB) The explanation of the paragraph change in the FT chapter

mpiforumbot commented 8 years ago

Originally by jsquyres on 2012-06-20 09:54:12 -0500


1st vote failed in Japan Forum meeting, May, 2012. Moved to "author rework".

mpiforumbot commented 8 years ago

Originally by bouteill on 2012-07-20 18:26:53 -0500


Attachment added: mpi-report.pdf (2323.6 KiB) Same as Japan read, without any ticket0

mpiforumbot commented 8 years ago

Originally by bouteill on 2014-02-15 01:27:16 -0600


Attachment added: mpi-report-ticket323-r179-20140215.pdf (2683.6 KiB) Final draft for March 2014 reading

mpiforumbot commented 8 years ago

Originally by @wbland on 2014-05-14 13:44:47 -0500


Attachment added: mpi-report-ticket323-r242-20140514.pdf (2681.6 KiB) Draft to be read at June 2014 meeting

mpiforumbot commented 8 years ago

Originally by bouteill on 2014-05-18 15:19:41 -0500


Attachment added: mpi31-ticket323-r252-20140518.pdf (2684.2 KiB) Document to be read in Chicago in June 2014

mpiforumbot commented 8 years ago

Originally by bouteill on 2014-06-04 10:40:09 -0500


Attachment added: ft.r179-252.pdf (229.0 KiB) Differences between March document and June document

mpiforumbot commented 8 years ago

Originally by bouteill on 2014-09-15 02:44:04 -0500


Attachment added: ft.pdf (216.5 KiB) RMA changes for sept. JP meeting

mpiforumbot commented 8 years ago

Originally by jhammond on 2015-01-19 17:50:29 -0600


What does "associated communication object" mean in the following text?

"The operation is collective, and the process appears in one of the groups of the associated communication object."

For example, MPI_COMM_CREATE_GROUP takes both a communicator and group argument. Processes in the comm but not the group do not call this function and I'd argue are not "involved", to use the term in the previous sentence.

mpiforumbot commented 8 years ago

Originally by jhammond on 2015-01-19 18:06:57 -0600


Here is my proposed amendment to the RMA FT semantics to make it useful in the context of non-Byzantine fault-tolerance:

OLD

When an operation on a window raises an exception related to process failure, the state of all data held in memory exposed by that window becomes undefined.

NEW

When an operation on a window raises an exception related to process failure, the state of any memory exposed by that window becomes undefined if (1) The memory could have been updated by an RMA operation during the most recent phase; (2) The window is a shared-memory window.

If the user knows that a window could not have been updated by an RMA operation, either because of the structure of the communication pattern or because the phase does not update the window, the data is well-defined, at least from an MPI perspective.

The changes are:

This is consistent with the statement by the FT WG that Byzantine fault-tolerance is out-of-scope.

I do not know how to define phase clearly yet because I haven't figured out how to delineate an FT epoch sufficiently.

mpiforumbot commented 8 years ago

Originally by jhammond on 2015-01-19 18:14:40 -0600


This means that page 5 line 11 of the latest FT proposal must be amended somehow, as it pertains to the use of the phrase "epoch closing" (which should be "epoch-closing", no?), unless you deliberately mean to exclude MPI_WIN_FLUSH(_LOCAL)(_ALL) and MPI_WIN_SYNC from the list of functions that must raise a process failure exception. And if they are excluded, then their relationship to FT is ambiguous, since they are neither communication operations nor epoch-closing synchronization.

I suppose that we should treat MPI_WIN_FLUSH_LOCAL(_ALL) differently from MPI_WIN_FLUSH(_ALL), since the former is a local operation and the latter is a nonlocal one. Given that MPI_WIN_FLUSH(_ALL) induce remote completion, they will detect remote process failures and thus can be required to raise these without introducing unreasonable overhead.

mpiforumbot commented 8 years ago

Originally by bouteill on 2015-01-20 07:57:23 -0600


Replying to jhammond:

What does "associated communication object" mean in the following text?

"The operation is collective, and the process appears in one of the groups of the associated communication object."

For example, MPI_COMM_CREATE_GROUP takes both a communicator and group argument. Processes in the comm but not the group do not call this function and I'd argue are not "involved", to use the term in the previous sentence.

Jeff, you are correct. What about "The operation is collective, and the process appears in one of the groups over which the communication operation spans." ?

mpiforumbot commented 8 years ago

Originally by @wbland on 2015-01-20 13:10:58 -0600


The reason we ended up with the "data held in memory" text is because we used to have what you proposed and it was pointed out by someone (I don't remember who) that this could imply that the memory itself may be unusable for future usage. Thus we changed it to clearly state that the data is what was undefined. The memory itself could be reused.

The other issue that was raised when we had text very similar to this is that by the definitions in the RMA chapter, it's legal for an implementation to do something nasty during an MPI_PUT like overwrite an entire window with garbage, then go back and put in the correct data afterward. I think this was for cases where the network hardware might need to write larger chunks of memory than the user was asking for. In that case, it would be possible for memory that the user thinks wasn't touched to actually get trashed because of implementation details. I don't have an exact page/line citation for this in the RMA chapter, but I can look for it.

The shared memory exception is good since that was another one of the problems that was raised. We would need to make it clear that if either 1 or 2 is true, then things are undefined.

Replying to jhammond:

The changes are:

  • Remove "data held in memory" because this has no useful meaning in the context of RMA.
  • Allow window memory the application knows is untouched to remain well-defined.
  • Call out shared-memory windows as undefined in any case, because of how this allocation may have to occur.
mpiforumbot commented 8 years ago

Originally by jhammond on 2015-01-21 00:41:58 -0600


Then how about we use "data" instead of "memory" since that makes it clear that the window data is the issue but the DRAM cells might be okay.

The implementation of RMA that you described should be banned (I'll create the ticket as soon as you find the text that allows it), as should any discussion based upon it :-) That's basically Byzantine fault-tolerance except without the fault-tolerance. Let's just call that a Byzantine implementation and excise it from our minds.

Replying to wbland:

The reason we ended up with the "data held in memory" text is because we used to have what you proposed and it was pointed out by someone (I don't remember who) that this could imply that the memory itself may be unusable for future usage. Thus we changed it to clearly state that the data is what was undefined. The memory itself could be reused.

The other issue that was raised when we had text very similar to this is that by the definitions in the RMA chapter, it's legal for an implementation to do something nasty during an MPI_PUT like overwrite an entire window with garbage, then go back and put in the correct data afterward. I think this was for cases where the network hardware might need to write larger chunks of memory than the user was asking for. In that case, it would be possible for memory that the user thinks wasn't touched to actually get trashed because of implementation details. I don't have an exact page/line citation for this in the RMA chapter, but I can look for it.

The shared memory exception is good since that was another one of the problems that was raised. We would need to make it clear that if either 1 or 2 is true, then things are undefined.

Replying to jhammond:

The changes are:

  • Remove "data held in memory" because this has no useful meaning in the context of RMA.
  • Allow window memory the application knows is untouched to remain well-defined.
  • Call out shared-memory windows as undefined in any case, because of how this allocation may have to occur.
mpiforumbot commented 8 years ago

Originally by jhammond on 2015-01-21 00:44:43 -0600


Replying to bouteill:

Replying to jhammond:

What does "associated communication object" mean in the following text?

"The operation is collective, and the process appears in one of the groups of the associated communication object."

For example, MPI_COMM_CREATE_GROUP takes both a communicator and group argument. Processes in the comm but not the group do not call this function and I'd argue are not "involved", to use the term in the previous sentence.

Jeff, you are correct. What about "The operation is collective, and the process appears in one of the groups over which the communication operation spans." ?

I don't have strong feelings about the wording, but we might try to reuse the language from e.g. MPI_COMM_CREATE_GROUP to make it easy to understand. I'm not sure "span" has a clear meaning in MPI.

mpiforumbot commented 8 years ago

Originally by bouteill on 2015-02-03 14:50:24 -0600


Replying to jhammond:

This means that page 5 line 11 of the latest FT proposal must be amended somehow, as it pertains to the use of the phrase "epoch closing" (which should be "epoch-closing", no?), unless you deliberately mean to exclude MPI_WIN_FLUSH(_LOCAL)(_ALL) and MPI_WIN_SYNC from the list of functions that must raise a process failure exception. And if they are excluded, then their relationship to FT is ambiguous, since they are neither communication operations nor epoch-closing synchronization.

I suppose that we should treat MPI_WIN_FLUSH_LOCAL(_ALL) differently from MPI_WIN_FLUSH(_ALL), since the former is a local operation and the latter is a nonlocal one. Given that MPI_WIN_FLUSH(_ALL) induce remote completion, they will detect remote process failures and thus can be required to raise these without introducing unreasonable overhead.

Ok, thinking more about this I came to the conclusion that the current text is correct: WIN_FLUSH is not local, but it is ordering more than remote completion, so it may not always detect errors. If it does (when the particular implementation does guarantee remote completion), it will raise an exception (as is possible in any cases), when the implementation is not synchronizing, it will not. Mandating the raising of the exception may make it more expensive.

We may want to add some rationale/advices to remember why we came to that conclusion (if you agree with me here) ?

mpiforumbot commented 8 years ago

Originally by jhammond on 2015-02-03 15:32:34 -0600


Replying to bouteill:

Replying to jhammond:

This means that page 5 line 11 of the latest FT proposal must be amended somehow, as it pertains to the use of the phrase "epoch closing" (which should be "epoch-closing", no?), unless you deliberately mean to exclude MPI_WIN_FLUSH(_LOCAL)(_ALL) and MPI_WIN_SYNC from the list of functions that must raise a process failure exception. And if they are excluded, then their relationship to FT is ambiguous, since they are neither communication operations nor epoch-closing synchronization.

I suppose that we should treat MPI_WIN_FLUSH_LOCAL(_ALL) differently from MPI_WIN_FLUSH(_ALL), since the former is a local operation and the latter is a nonlocal one. Given that MPI_WIN_FLUSH(_ALL) induce remote completion, they will detect remote process failures and thus can be required to raise these without introducing unreasonable overhead.

Ok, thinking more about this I came to the conclusion that the current text is correct: WIN_FLUSH is not local, but it is ordering more than remote completion, so it may not always detect errors. If it does (when the particular implementation does guarantee remote completion), it will raise an exception (as is possible in any cases), when the implementation is not synchronizing, it will not. Mandating the raising of the exception may make it more expensive.

It is not "ordering more than remote completion". Are you confusing MPI-3 RMA semantics with OpenSHMEM (e.g. shmem_fence and shmem_quiet)?

From MPI-3 11.5.4:

MPI_WIN_FLUSH completes all outstanding RMA operations initiated by the calling process to the target rank on the specified window. The operations are completed both at the origin and at the target.

Remote completion implies global visibility, which is unconditionally a remote synchronization operation.

The ordering issue is only relevant to accumulate operations, not MPI_Put and MPI_Get, so we should ignore that here. Whatever is true of FT-RMA needs to be true irrespective of ordering.

We may want to add some rationale/advices to remember why we came to that conclusion (if you agree with me here) ?

In the latest MPI FT draft, section 15.2.4 line 38 "the operation closing the containing epoch" needs to be changed to something like "operations that synchronize RMA operations remotely (e.g. MPI_WIN_FENCE, MPI_WIN_(TEST,WAIT) and MPI_WIN_FLUSH(_ALL) but not MPI_WIN_COMPLETE, MPI_WIN_FLUSH_LOCAL(_ALL) or MPI_WIN_SYNC)" The parenthetical enumeration is pedantic but useful for the readers benefit.

mpiforumbot commented 8 years ago

Originally by bouteill on 2015-02-04 07:22:23 -0600


Replying to jhammond:

Replying to bouteill:

Replying to jhammond:

This means that page 5 line 11 of the latest FT proposal must be amended somehow, as it pertains to the use of the phrase "epoch closing" (which should be "epoch-closing", no?), unless you deliberately mean to exclude MPI_WIN_FLUSH(_LOCAL)(_ALL) and MPI_WIN_SYNC from the list of functions that must raise a process failure exception. And if they are excluded, then their relationship to FT is ambiguous, since they are neither communication operations nor epoch-closing synchronization.

I suppose that we should treat MPI_WIN_FLUSH_LOCAL(_ALL) differently from MPI_WIN_FLUSH(_ALL), since the former is a local operation and the latter is a nonlocal one. Given that MPI_WIN_FLUSH(_ALL) induce remote completion, they will detect remote process failures and thus can be required to raise these without introducing unreasonable overhead.

Ok, thinking more about this I came to the conclusion that the current text is correct: WIN_FLUSH is not local, but it is ordering more than remote completion, so it may not always detect errors. If it does (when the particular implementation does guarantee remote completion), it will raise an exception (as is possible in any cases), when the implementation is not synchronizing, it will not. Mandating the raising of the exception may make it more expensive.

It is not "ordering more than remote completion". Are you confusing MPI-3 RMA semantics with OpenSHMEM (e.g. shmem_fence and shmem_quiet)?

From MPI-3 11.5.4:

MPI_WIN_FLUSH completes all outstanding RMA operations initiated by the calling process to the target rank on the specified window. The operations are completed both at the origin and at the target.

Remote completion implies global visibility, which is unconditionally a remote synchronization operation.

The ordering issue is only relevant to accumulate operations, not MPI_Put and MPI_Get, so we should ignore that here. Whatever is true of FT-RMA needs to be true irrespective of ordering.

We may want to add some rationale/advices to remember why we came to that conclusion (if you agree with me here) ?

In the latest MPI FT draft, section 15.2.4 line 38 "the operation closing the containing epoch" needs to be changed to something like "operations that synchronize RMA operations remotely (e.g. MPI_WIN_FENCE, MPI_WIN_(TEST,WAIT) and MPI_WIN_FLUSH(_ALL) but not MPI_WIN_COMPLETE, MPI_WIN_FLUSH_LOCAL(_ALL) or MPI_WIN_SYNC)" The parenthetical enumeration is pedantic but useful for the readers benefit.

I created the following issue on the ULFM repo to track progress on this item: https://bitbucket.org/bosilca/mpi3ft/issue/16/mpi_win_flush-error-reporting

mpiforumbot commented 8 years ago

Originally by bouteill on 2015-02-04 07:32:31 -0600


Replying to jhammond:

Here is my proposed amendment to the RMA FT semantics to make it useful in the context of non-Byzantine fault-tolerance:

OLD

When an operation on a window raises an exception related to process failure, the state of all data held in memory exposed by that window becomes undefined.

NEW

When an operation on a window raises an exception related to process failure, the state of any memory exposed by that window becomes undefined if (1) The memory could have been updated by an RMA operation during the most recent phase; (2) The window is a shared-memory window.

If the user knows that a window could not have been updated by an RMA operation, either because of the structure of the communication pattern or because the phase does not update the window, the data is well-defined, at least from an MPI perspective.

The changes are:

  • Remove "data held in memory" because this has no useful meaning in the context of RMA.
  • Allow window memory the application knows is untouched to remain well-defined.
  • Call out shared-memory windows as undefined in any case, because of how this allocation may have to occur.

This is consistent with the statement by the FT WG that Byzantine fault-tolerance is out-of-scope.

I do not know how to define phase clearly yet because I haven't figured out how to delineate an FT epoch sufficiently.

This issue on the ULFM repo tracks progress toward resolution of this proposal https://bitbucket.org/bosilca/mpi3ft/issue/17/exposed-memory-damaged-when-failures

mpiforumbot commented 8 years ago

Originally by bouteill on 2015-02-04 07:36:30 -0600


Replying to jhammond:

Replying to bouteill:

Replying to jhammond:

What does "associated communication object" mean in the following text?

"The operation is collective, and the process appears in one of the groups of the associated communication object."

For example, MPI_COMM_CREATE_GROUP takes both a communicator and group argument. Processes in the comm but not the group do not call this function and I'd argue are not "involved", to use the term in the previous sentence.

Jeff, you are correct. What about "The operation is collective, and the process appears in one of the groups over which the communication operation spans." ?

I don't have strong feelings about the wording, but we might try to reuse the language from e.g. MPI_COMM_CREATE_GROUP to make it easy to understand. I'm not sure "span" has a clear meaning in MPI.

Progress on this bug is tracked on the ULFM repo: https://bitbucket.org/bosilca/mpi3ft/issue/18/involved-and-groups-of-the-associated-comm

mpiforumbot commented 8 years ago

Originally by bouteill on 2015-02-13 14:32:03 -0600


Replying to bouteill:

Replying to jhammond:

Replying to bouteill:

Replying to jhammond:

What does "associated communication object" mean in the following text?

"The operation is collective, and the process appears in one of the groups of the associated communication object."

For example, MPI_COMM_CREATE_GROUP takes both a communicator and group argument. Processes in the comm but not the group do not call this function and I'd argue are not "involved", to use the term in the previous sentence.

Jeff, you are correct. What about "The operation is collective, and the process appears in one of the groups over which the communication operation spans." ?

I don't have strong feelings about the wording, but we might try to reuse the language from e.g. MPI_COMM_CREATE_GROUP to make it easy to understand. I'm not sure "span" has a clear meaning in MPI.

Progress on this bug is tracked on the ULFM repo: https://bitbucket.org/bosilca/mpi3ft/issue/18/involved-and-groups-of-the-associated-comm

proposed alternative text: https://bitbucket.org/bosilca/mpi3ft/pull-request/55/minor-tuning-of-involved-definitions/

#!diff
-    \item The operation is collective, and the process appears in one of the
-        groups {of the associated communication object}.
+    \item The process is in the group over which the operation is collective.

-    \item The process is a specified or matched destination or source in a
+    \item The process is a destination or a specified or matched source in a
         point-to-point communication.

     \item The operation is an \const{MPI\_ANY\_SOURCE} receive operation and the
-        failed process belongs to the source group.
+        process belongs to the source group.

     \item The process is a specified target in a remote memory operation.
mpiforumbot commented 8 years ago

Originally by bouteill on 2015-03-02 00:12:46 -0600


Attachment added: mpi31-t323-r419-20150301.pdf (2686.8 KiB)

mpiforumbot commented 8 years ago

Originally by @wbland on 2015-04-09 15:30:51 -0500


Tickets #325 (RMA) and #326 (I/O) have been reopened as a home for the RMA and I/O portions of ULFM. The goal here is to make reading the ticket simpler and to allow the less contentious portions of ULFM (communicators and files) to make progress independently of the more contentious ones (RMA). Obviously, the intention is to get all portions of FT in to the same version of the standard. Splitting them into multiple tickets is simply a way to make things simpler.