Open mpiforumbot opened 8 years ago
Originally by davesolt on 2012-02-27 11:12:11 -0600
Reviewed by Dave Solt.
Originally by jjhursey on 2012-02-27 11:59:09 -0600
Reviewed by Josh Hursey
Originally by bgoglin on 2012-02-27 13:35:01 -0600
Reviewed by Brice Goglin
Originally by davesolt on 2012-02-27 14:43:44 -0600
Reviewed after latest update. Some minor ticket 0 items handed back for future consideration.
Originally by jjhursey on 2012-03-05 18:32:17 -0600
Attachment added: FT Chapter.pdf
(931.5 KiB)
Reading Text with Ticket 0 text changes applied (no markings) - Just FT Chapter
Originally by jjhursey on 2012-03-05 18:35:18 -0600
The full document with the ticket 0 applied - as read during this meeting is available at the link below (too big to attach directly to this ticket): http://osl.iu.edu/~jjhursey/public/mpi-forum/ticket323/mpi-report-ticket-323-ticket-0-applied.pdf
Originally by jjhursey on 2012-03-06 21:33:24 -0600
Attachment added: mpi-report-323-revised-2012-03-06.pdf
(2324.7 KiB)
New 323 ticket for reading on March 7, 2012
Originally by bouteill on 2012-03-30 16:22:22 -0500
Attachment added: mpi3forum.pdf
(2881.5 KiB)
The slides of the first reading, used during the Chicago meeting (3-5-12)
Originally by jjhursey on 2012-05-01 14:48:36 -0500
Attachment added: 2012-05-01-mpi-ft-draft.pdf
(2324.7 KiB)
An updated draft of the document to be used for teleconf discussions
Originally by jjhursey on 2012-05-21 12:05:57 -0500
A link to the Open MPI prototype Beta 1 release:
Originally by bouteill on 2012-05-21 19:07:34 -0500
Attachment added: mpi-report-17clean1.pdf
(2325.5 KiB)
Ticket 0 changes from previous reading, without renamings - for review by the WG
Originally by bouteill on 2012-05-23 13:57:15 -0500
Attachment added: mpi-report-17complete1.pdf
(2327.7 KiB)
The complete ticket323 with all ticket 0 changes from previous reading. Some are automatic renaming and are cluttering. For reference, readers are enjoined to use the "clean" version.
Originally by bouteill on 2012-05-23 14:14:07 -0500
The two following attachments are to be used during the reading in Japan next week.
Originally by bosilca on 2012-05-28 22:45:00 -0500
Attachment added: MPI3ft-paragraph-change.pdf
(656.3 KiB)
The explanation of the paragraph change in the FT chapter
Originally by jsquyres on 2012-06-20 09:54:12 -0500
1st vote failed in Japan Forum meeting, May, 2012. Moved to "author rework".
Originally by bouteill on 2012-07-20 18:26:53 -0500
Attachment added: mpi-report.pdf
(2323.6 KiB)
Same as Japan read, without any ticket0
Originally by bouteill on 2014-02-15 01:27:16 -0600
Attachment added: mpi-report-ticket323-r179-20140215.pdf
(2683.6 KiB)
Final draft for March 2014 reading
Originally by @wbland on 2014-05-14 13:44:47 -0500
Attachment added: mpi-report-ticket323-r242-20140514.pdf
(2681.6 KiB)
Draft to be read at June 2014 meeting
Originally by bouteill on 2014-05-18 15:19:41 -0500
Attachment added: mpi31-ticket323-r252-20140518.pdf
(2684.2 KiB)
Document to be read in Chicago in June 2014
Originally by bouteill on 2014-06-04 10:40:09 -0500
Attachment added: ft.r179-252.pdf
(229.0 KiB)
Differences between March document and June document
Originally by bouteill on 2014-09-15 02:44:04 -0500
Attachment added: ft.pdf
(216.5 KiB)
RMA changes for sept. JP meeting
Originally by jhammond on 2015-01-19 17:50:29 -0600
What does "associated communication object" mean in the following text?
"The operation is collective, and the process appears in one of the groups of the associated communication object."
For example, MPI_COMM_CREATE_GROUP
takes both a communicator and group argument. Processes in the comm but not the group do not call this function and I'd argue are not "involved", to use the term in the previous sentence.
Originally by jhammond on 2015-01-19 18:06:57 -0600
Here is my proposed amendment to the RMA FT semantics to make it useful in the context of non-Byzantine fault-tolerance:
OLD
When an operation on a window raises an exception related to process failure, the state of all data held in memory exposed by that window becomes undefined.
NEW
When an operation on a window raises an exception related to process failure, the state of any memory exposed by that window becomes undefined if (1) The memory could have been updated by an RMA operation during the most recent phase; (2) The window is a shared-memory window.
If the user knows that a window could not have been updated by an RMA operation, either because of the structure of the communication pattern or because the phase does not update the window, the data is well-defined, at least from an MPI perspective.
The changes are:
This is consistent with the statement by the FT WG that Byzantine fault-tolerance is out-of-scope.
I do not know how to define phase clearly yet because I haven't figured out how to delineate an FT epoch sufficiently.
Originally by jhammond on 2015-01-19 18:14:40 -0600
This means that page 5 line 11 of the latest FT proposal must be amended somehow, as it pertains to the use of the phrase "epoch closing" (which should be "epoch-closing", no?), unless you deliberately mean to exclude MPI_WIN_FLUSH(_LOCAL)(_ALL)
and MPI_WIN_SYNC
from the list of functions that must raise a process failure exception. And if they are excluded, then their relationship to FT is ambiguous, since they are neither communication operations nor epoch-closing synchronization.
I suppose that we should treat MPI_WIN_FLUSH_LOCAL(_ALL)
differently from MPI_WIN_FLUSH(_ALL)
, since the former is a local operation and the latter is a nonlocal one. Given that MPI_WIN_FLUSH(_ALL)
induce remote completion, they will detect remote process failures and thus can be required to raise these without introducing unreasonable overhead.
Originally by bouteill on 2015-01-20 07:57:23 -0600
Replying to jhammond:
What does "associated communication object" mean in the following text?
"The operation is collective, and the process appears in one of the groups of the associated communication object."
For example,
MPI_COMM_CREATE_GROUP
takes both a communicator and group argument. Processes in the comm but not the group do not call this function and I'd argue are not "involved", to use the term in the previous sentence.
Jeff, you are correct. What about "The operation is collective, and the process appears in one of the groups over which the communication operation spans." ?
Originally by @wbland on 2015-01-20 13:10:58 -0600
The reason we ended up with the "data held in memory" text is because we used to have what you proposed and it was pointed out by someone (I don't remember who) that this could imply that the memory itself may be unusable for future usage. Thus we changed it to clearly state that the data is what was undefined. The memory itself could be reused.
The other issue that was raised when we had text very similar to this is that by the definitions in the RMA chapter, it's legal for an implementation to do something nasty during an MPI_PUT
like overwrite an entire window with garbage, then go back and put in the correct data afterward. I think this was for cases where the network hardware might need to write larger chunks of memory than the user was asking for. In that case, it would be possible for memory that the user thinks wasn't touched to actually get trashed because of implementation details. I don't have an exact page/line citation for this in the RMA chapter, but I can look for it.
The shared memory exception is good since that was another one of the problems that was raised. We would need to make it clear that if either 1 or 2 is true, then things are undefined.
Replying to jhammond:
The changes are:
- Remove "data held in memory" because this has no useful meaning in the context of RMA.
- Allow window memory the application knows is untouched to remain well-defined.
- Call out shared-memory windows as undefined in any case, because of how this allocation may have to occur.
Originally by jhammond on 2015-01-21 00:41:58 -0600
Then how about we use "data" instead of "memory" since that makes it clear that the window data is the issue but the DRAM cells might be okay.
The implementation of RMA that you described should be banned (I'll create the ticket as soon as you find the text that allows it), as should any discussion based upon it :-) That's basically Byzantine fault-tolerance except without the fault-tolerance. Let's just call that a Byzantine implementation and excise it from our minds.
Replying to wbland:
The reason we ended up with the "data held in memory" text is because we used to have what you proposed and it was pointed out by someone (I don't remember who) that this could imply that the memory itself may be unusable for future usage. Thus we changed it to clearly state that the data is what was undefined. The memory itself could be reused.
The other issue that was raised when we had text very similar to this is that by the definitions in the RMA chapter, it's legal for an implementation to do something nasty during an
MPI_PUT
like overwrite an entire window with garbage, then go back and put in the correct data afterward. I think this was for cases where the network hardware might need to write larger chunks of memory than the user was asking for. In that case, it would be possible for memory that the user thinks wasn't touched to actually get trashed because of implementation details. I don't have an exact page/line citation for this in the RMA chapter, but I can look for it.The shared memory exception is good since that was another one of the problems that was raised. We would need to make it clear that if either 1 or 2 is true, then things are undefined.
Replying to jhammond:
The changes are:
- Remove "data held in memory" because this has no useful meaning in the context of RMA.
- Allow window memory the application knows is untouched to remain well-defined.
- Call out shared-memory windows as undefined in any case, because of how this allocation may have to occur.
Originally by jhammond on 2015-01-21 00:44:43 -0600
Replying to bouteill:
Replying to jhammond:
What does "associated communication object" mean in the following text?
"The operation is collective, and the process appears in one of the groups of the associated communication object."
For example,
MPI_COMM_CREATE_GROUP
takes both a communicator and group argument. Processes in the comm but not the group do not call this function and I'd argue are not "involved", to use the term in the previous sentence.Jeff, you are correct. What about "The operation is collective, and the process appears in one of the groups over which the communication operation spans." ?
I don't have strong feelings about the wording, but we might try to reuse the language from e.g. MPI_COMM_CREATE_GROUP
to make it easy to understand. I'm not sure "span" has a clear meaning in MPI.
Originally by bouteill on 2015-02-03 14:50:24 -0600
Replying to jhammond:
This means that page 5 line 11 of the latest FT proposal must be amended somehow, as it pertains to the use of the phrase "epoch closing" (which should be "epoch-closing", no?), unless you deliberately mean to exclude
MPI_WIN_FLUSH(_LOCAL)(_ALL)
andMPI_WIN_SYNC
from the list of functions that must raise a process failure exception. And if they are excluded, then their relationship to FT is ambiguous, since they are neither communication operations nor epoch-closing synchronization.I suppose that we should treat
MPI_WIN_FLUSH_LOCAL(_ALL)
differently fromMPI_WIN_FLUSH(_ALL)
, since the former is a local operation and the latter is a nonlocal one. Given thatMPI_WIN_FLUSH(_ALL)
induce remote completion, they will detect remote process failures and thus can be required to raise these without introducing unreasonable overhead.
Ok, thinking more about this I came to the conclusion that the current text is correct: WIN_FLUSH is not local, but it is ordering more than remote completion, so it may not always detect errors. If it does (when the particular implementation does guarantee remote completion), it will raise an exception (as is possible in any cases), when the implementation is not synchronizing, it will not. Mandating the raising of the exception may make it more expensive.
We may want to add some rationale/advices to remember why we came to that conclusion (if you agree with me here) ?
Originally by jhammond on 2015-02-03 15:32:34 -0600
Replying to bouteill:
Replying to jhammond:
This means that page 5 line 11 of the latest FT proposal must be amended somehow, as it pertains to the use of the phrase "epoch closing" (which should be "epoch-closing", no?), unless you deliberately mean to exclude
MPI_WIN_FLUSH(_LOCAL)(_ALL)
andMPI_WIN_SYNC
from the list of functions that must raise a process failure exception. And if they are excluded, then their relationship to FT is ambiguous, since they are neither communication operations nor epoch-closing synchronization.I suppose that we should treat
MPI_WIN_FLUSH_LOCAL(_ALL)
differently fromMPI_WIN_FLUSH(_ALL)
, since the former is a local operation and the latter is a nonlocal one. Given thatMPI_WIN_FLUSH(_ALL)
induce remote completion, they will detect remote process failures and thus can be required to raise these without introducing unreasonable overhead.Ok, thinking more about this I came to the conclusion that the current text is correct: WIN_FLUSH is not local, but it is ordering more than remote completion, so it may not always detect errors. If it does (when the particular implementation does guarantee remote completion), it will raise an exception (as is possible in any cases), when the implementation is not synchronizing, it will not. Mandating the raising of the exception may make it more expensive.
It is not "ordering more than remote completion". Are you confusing MPI-3 RMA semantics with OpenSHMEM (e.g. shmem_fence
and shmem_quiet
)?
From MPI-3 11.5.4:
MPI_WIN_FLUSH completes all outstanding RMA operations initiated by the calling process to the target rank on the specified window. The operations are completed both at the origin and at the target.
Remote completion implies global visibility, which is unconditionally a remote synchronization operation.
The ordering issue is only relevant to accumulate operations, not MPI_Put
and MPI_Get
, so we should ignore that here. Whatever is true of FT-RMA needs to be true irrespective of ordering.
We may want to add some rationale/advices to remember why we came to that conclusion (if you agree with me here) ?
In the latest MPI FT draft, section 15.2.4 line 38 "the operation closing the containing epoch" needs to be changed to something like "operations that synchronize RMA operations remotely (e.g. MPI_WIN_FENCE
, MPI_WIN_(TEST,WAIT)
and MPI_WIN_FLUSH(_ALL)
but not MPI_WIN_COMPLETE
, MPI_WIN_FLUSH_LOCAL(_ALL)
or MPI_WIN_SYNC
)" The parenthetical enumeration is pedantic but useful for the readers benefit.
Originally by bouteill on 2015-02-04 07:22:23 -0600
Replying to jhammond:
Replying to bouteill:
Replying to jhammond:
This means that page 5 line 11 of the latest FT proposal must be amended somehow, as it pertains to the use of the phrase "epoch closing" (which should be "epoch-closing", no?), unless you deliberately mean to exclude
MPI_WIN_FLUSH(_LOCAL)(_ALL)
andMPI_WIN_SYNC
from the list of functions that must raise a process failure exception. And if they are excluded, then their relationship to FT is ambiguous, since they are neither communication operations nor epoch-closing synchronization.I suppose that we should treat
MPI_WIN_FLUSH_LOCAL(_ALL)
differently fromMPI_WIN_FLUSH(_ALL)
, since the former is a local operation and the latter is a nonlocal one. Given thatMPI_WIN_FLUSH(_ALL)
induce remote completion, they will detect remote process failures and thus can be required to raise these without introducing unreasonable overhead.Ok, thinking more about this I came to the conclusion that the current text is correct: WIN_FLUSH is not local, but it is ordering more than remote completion, so it may not always detect errors. If it does (when the particular implementation does guarantee remote completion), it will raise an exception (as is possible in any cases), when the implementation is not synchronizing, it will not. Mandating the raising of the exception may make it more expensive.
It is not "ordering more than remote completion". Are you confusing MPI-3 RMA semantics with OpenSHMEM (e.g.
shmem_fence
andshmem_quiet
)?From MPI-3 11.5.4:
MPI_WIN_FLUSH completes all outstanding RMA operations initiated by the calling process to the target rank on the specified window. The operations are completed both at the origin and at the target.
Remote completion implies global visibility, which is unconditionally a remote synchronization operation.
The ordering issue is only relevant to accumulate operations, not
MPI_Put
andMPI_Get
, so we should ignore that here. Whatever is true of FT-RMA needs to be true irrespective of ordering.We may want to add some rationale/advices to remember why we came to that conclusion (if you agree with me here) ?
In the latest MPI FT draft, section 15.2.4 line 38 "the operation closing the containing epoch" needs to be changed to something like "operations that synchronize RMA operations remotely (e.g.
MPI_WIN_FENCE
,MPI_WIN_(TEST,WAIT)
andMPI_WIN_FLUSH(_ALL)
but notMPI_WIN_COMPLETE
,MPI_WIN_FLUSH_LOCAL(_ALL)
orMPI_WIN_SYNC
)" The parenthetical enumeration is pedantic but useful for the readers benefit.
I created the following issue on the ULFM repo to track progress on this item: https://bitbucket.org/bosilca/mpi3ft/issue/16/mpi_win_flush-error-reporting
Originally by bouteill on 2015-02-04 07:32:31 -0600
Replying to jhammond:
Here is my proposed amendment to the RMA FT semantics to make it useful in the context of non-Byzantine fault-tolerance:
OLD
When an operation on a window raises an exception related to process failure, the state of all data held in memory exposed by that window becomes undefined.
NEW
When an operation on a window raises an exception related to process failure, the state of any memory exposed by that window becomes undefined if (1) The memory could have been updated by an RMA operation during the most recent phase; (2) The window is a shared-memory window.
If the user knows that a window could not have been updated by an RMA operation, either because of the structure of the communication pattern or because the phase does not update the window, the data is well-defined, at least from an MPI perspective.
The changes are:
- Remove "data held in memory" because this has no useful meaning in the context of RMA.
- Allow window memory the application knows is untouched to remain well-defined.
- Call out shared-memory windows as undefined in any case, because of how this allocation may have to occur.
This is consistent with the statement by the FT WG that Byzantine fault-tolerance is out-of-scope.
I do not know how to define phase clearly yet because I haven't figured out how to delineate an FT epoch sufficiently.
This issue on the ULFM repo tracks progress toward resolution of this proposal https://bitbucket.org/bosilca/mpi3ft/issue/17/exposed-memory-damaged-when-failures
Originally by bouteill on 2015-02-04 07:36:30 -0600
Replying to jhammond:
Replying to bouteill:
Replying to jhammond:
What does "associated communication object" mean in the following text?
"The operation is collective, and the process appears in one of the groups of the associated communication object."
For example,
MPI_COMM_CREATE_GROUP
takes both a communicator and group argument. Processes in the comm but not the group do not call this function and I'd argue are not "involved", to use the term in the previous sentence.Jeff, you are correct. What about "The operation is collective, and the process appears in one of the groups over which the communication operation spans." ?
I don't have strong feelings about the wording, but we might try to reuse the language from e.g.
MPI_COMM_CREATE_GROUP
to make it easy to understand. I'm not sure "span" has a clear meaning in MPI.
Progress on this bug is tracked on the ULFM repo: https://bitbucket.org/bosilca/mpi3ft/issue/18/involved-and-groups-of-the-associated-comm
Originally by bouteill on 2015-02-13 14:32:03 -0600
Replying to bouteill:
Replying to jhammond:
Replying to bouteill:
Replying to jhammond:
What does "associated communication object" mean in the following text?
"The operation is collective, and the process appears in one of the groups of the associated communication object."
For example,
MPI_COMM_CREATE_GROUP
takes both a communicator and group argument. Processes in the comm but not the group do not call this function and I'd argue are not "involved", to use the term in the previous sentence.Jeff, you are correct. What about "The operation is collective, and the process appears in one of the groups over which the communication operation spans." ?
I don't have strong feelings about the wording, but we might try to reuse the language from e.g.
MPI_COMM_CREATE_GROUP
to make it easy to understand. I'm not sure "span" has a clear meaning in MPI.Progress on this bug is tracked on the ULFM repo: https://bitbucket.org/bosilca/mpi3ft/issue/18/involved-and-groups-of-the-associated-comm
proposed alternative text: https://bitbucket.org/bosilca/mpi3ft/pull-request/55/minor-tuning-of-involved-definitions/
#!diff
- \item The operation is collective, and the process appears in one of the
- groups {of the associated communication object}.
+ \item The process is in the group over which the operation is collective.
- \item The process is a specified or matched destination or source in a
+ \item The process is a destination or a specified or matched source in a
point-to-point communication.
\item The operation is an \const{MPI\_ANY\_SOURCE} receive operation and the
- failed process belongs to the source group.
+ process belongs to the source group.
\item The process is a specified target in a remote memory operation.
Originally by bouteill on 2015-03-02 00:12:46 -0600
Attachment added: mpi31-t323-r419-20150301.pdf
(2686.8 KiB)
Originally by @wbland on 2015-04-09 15:30:51 -0500
Tickets #325 (RMA) and #326 (I/O) have been reopened as a home for the RMA and I/O portions of ULFM. The goal here is to make reading the ticket simpler and to allow the less contentious portions of ULFM (communicators and files) to make progress independently of the more contentious ones (RMA). Obviously, the intention is to get all portions of FT in to the same version of the standard. Splitting them into multiple tickets is simply a way to make things simpler.
Originally by bosilca on 2012-02-24 16:36:42 -0600
323-E: User-Level Failure Mitigation## Votes
no votes yet.
Description
This chapter describes a flexible approach, providing process fault tolerance by allowing the application to react to failures, while maintaining a minimal execution path in failure-free executions. The focus is on returning control to the application by avoiding deadlocks due to failures within the MPI library.
Some rationale and details are in the Wiki: https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/User_Level_Failure_Mitigation
-The working repository for the text of the document is available on bitbucket* (read access given on request) https://bitbucket.org/bosilca/mpi3ft
-Note: the versions attached to the ticket are updated loosely and are not always current. Please demand access to the repository to gain access to the latest revision, especially between forum meetings.*
More information on the prototype implementation in Open MPI can be found here: http://fault-tolerance.org/
Proposed Solution
Process errors denote impossibility to provide normal MPI semantic during an operation (as observed by a particular process). Specify clearly error classes returned in this scenario, provide new APIs for applications to obtain a consistent view of failures, add new APIs to create replacement communication objects to replace damaged objects.
Impact on Implementations
Adds semantic and functions to communicator operations. Implementations that do not care about fault tolerance have to provide all the proposed functions, with the correct semantic when no failure occur. However, an implementation that never raise an exception related to process failures does not have to actually tolerate failures.
Impact on Applications / Users
Provides fault tolerance to interested users. Users/implementations that do not care about fault tolerance are not impacted. Performance and code are unchanged.
Alternative Solutions
Stronger consistency models are more convenient to users, but much more expensive. These can be implemented on top of this proposal as user libraries (or potential future candidates to standardization, without conflict).
History
This submission gathers the three separate tickets for different topics (RMA, I/O, dynamic processes) that were present in the past. Everything is now inside this single document and ticket.
[wiki:"ft/run_through_stabilization" run-through stabilization] proposal was a complete different effort. This current ticket represents a ground-up restart, accounting for the issues raised during this previous work.
Entry for the Change Log