mpi-forum / mpi-issues

Tickets for the MPI Forum
http://www.mpi-forum.org/
66 stars 7 forks source link

User-Level Failure Mitigation #20

Open wesbland opened 8 years ago

wesbland commented 8 years ago

Description

This chapter describes a flexible approach, providing process fault tolerance by allowing the application to react to failures, while maintaining a minimal execution path in failure-free executions. The focus is on returning control to the application by avoiding deadlocks due to failures within the MPI library.

Note: the versions attached to the ticket are updated loosely and are not always current. Please access the repository to see the latest revision, especially between forum meetings.

More information on the prototype implementation in Open MPI can be found here: http://fault-tolerance.org/

Pull Requests: READING MAY 2022: Slice 1: https://github.com/mpi-forum/mpi-standard/pull/665

Current (full) draft proposal (May 2022): 20220509-ulfm-master.pdf rolling diff: https://github.com/mpiwg-ft/mpi-standard/pull/19/files

RFCs: https://github.com/mpiwg-ft/mpi-standard/pull/17 https://github.com/mpiwg-ft/mpi-standard/pull/18 https://github.com/mpiwg-ft/mpi-standard/commit/92f7596e8958dfc3a71bbc83514dec3d3b7dcc07

Proposed Solution

Process errors denote impossibility to provide normal MPI semantic during an operation (as observed by a particular process). Specify clearly error classes returned in this scenario, provide new APIs for applications to obtain a consistent view of failures, add new APIs to create replacement communication objects to replace damaged objects.

Impact on Implementations

Adds semantic and functions to communicator operations. Implementations that do not care about fault tolerance have to provide all the proposed functions, with the correct semantic when no failure occur. However, an implementation that never raise an exception related to process failures does not have to actually tolerate failures.

Impact on Applications / Users

Provides fault tolerance to interested users. Users/implementations that do not care about fault tolerance are not impacted. They can request that fault tolerance is turned off or the implementation can inform them that it is not supported.

Connected Issues

This issue only covers the main portion of ULFM (that talking about the general fault model, how MPI handles faults generally, and communicator-based functions). For the sections on RMA, the ticket #21 has more information. For sections of Files, the ticket #22 has more information.

Alternative Solutions

Stronger consistency models are more convenient to users, but much more expensive. These can be implemented on top of this proposal as user libraries (or potential future candidates to standardization, without conflict).

History

The run-through stabilization proposal was a complete different effort. This current ticket represents a ground-up restart, accounting for the issues raised during this previous work.

Lots of old conversations and history attached to this issue can be found on the old Trac ticket.

wesbland commented 8 years ago

The most recent copy of the text can be found here.

wesbland commented 8 years ago

PDF: issue-20-21-22.pdf

Pull Request: https://github.com/mpi-forum/mpi-standard/issues/13

abouteiller commented 8 years ago

Same pdf as above, with color annotations to find the changes.

issue-20-21-22-anotated.pdf

abouteiller commented 8 years ago

20160302-Aurelien@mpiforum-ulfmreading.pdf slides used for the reading

abouteiller commented 8 years ago

20160608-Aurelien@mpiforum-ulfmreading.pdf Updated slides

abouteiller commented 8 years ago

Updated PDF with comments from june 2016: issue-20-21-22-a905a5a.pdf

Full diff and changes from last reading available https://github.com/mpi-forum/mpi-standard/pull/13/commits

RolfRabenseifner commented 7 years ago

Please can you fix two things:

Proposal for that- it should look like fault tolerance --> 601 (BOLD),

process failure --> 20, <and all the many pages in the other section, like dynamic> MPI_FT --> MPI_ERR_PROC_FAILED --> .... process fault tolerance, see fault tolerance

abouteiller commented 7 years ago

Latest update for Dallas meeting. mpi32-report.pdf with colors: dec16-reading-color.pdf (note: with color includes the post 2 week deadline changes discussed https://github.com/mpi-forum/mpi-issues/issues/20#issuecomment-265330446)

Full set of changes vs 3.x https://github.com/mpi-forum/mpi-standard/pull/13/files

Slides highlighting changes since september reading 20161206-mpiforum-ulfmreading.pdf

Changes since the September reading: (removed to reduce the wall of text)

abouteiller commented 7 years ago

The following changes have been applied after the 2w deadline. Essentially, this add MPI qualifier to processes when the term could be ambiguous and makes the text identical to COMM_REVOKE.

 \mpitermindex{fault tolerance!revoke}

 \par
-This function notifies all processes in the window \mpiarg{win} that
+This function notifies all \MPI/ processes in the group
+associated with the window \mpiarg{win} that
 this window is revoked. The revocation of a window by any
 process completes RMA operations on \mpiarg{win} at all processes
-and RMA synchronizations raise an error of
+and RMA synchronizations on \mpiarg{win} raise an error of
 class \error{MPI\_ERR\_REVOKED}. This function is not collective and
 therefore does not have a matching call on remote processes. All
 non-failed processes belonging to \mpiarg{win} will be notified of the
@@ -764,18 +765,19 @@ processes detecting failures should use the call \mpifunc{MPI\_WIN\_REVOKE}.
 \mpitermindex{fault tolerance!revoke}

 \par
-This function notifies all processes in the file handle \mpiarg{fh}
+This function notifies all \MPI/ processes in the group
+associated with the file handle \mpiarg{fh}
 that this file handle is revoked. The revocation of a
 file handle by any process completes non-local \MPI/ operations on
 \mpiarg{fh} at all processes by raising an error of class
 \error{MPI\_ERR\_REVOKED}. This function is not collective and
 therefore does not have a matching call on remote processes. All
-non-failed processes belonging to \mpiarg{fh} will be notified of the
+non-failed \MPI/ processes belonging to \mpiarg{fh} will be notified of the
 revocation despite failures.

 \par
 A file handle is revoked at a given process either when
abouteiller commented 7 years ago

DONE:

Long discussion on error propagation between libraries delayed the remainder of the reading.

schulzm commented 7 years ago

I wouldn't say that the long discussion delayed the meeting - instead, I think we had some fundamental and (I think) productive discussions on the model and its alternatives. One of the big take aways from the discussion for me (and others should chime in as well on this) is that the underlying model in ULFM - the synchronous notification that requires all processes in a communicator to be active in >>that<< communicator - combined with the the collective nature of the shrink operator, is a fundamental problem and can lead to deadlocks for certain applications. Even adding an asynchronous shrink, but sticking with the collective model will not change that. Hacks, like forcing each process to have an active communication request with an unmatched wildcard receive to every communicator, are not a good answer, either.

We also had a lengthy discussion about possible alternatives focusing on asynchronous notifications and non-collective shrinks (at least from a user's perspective) and whether this can and should be hidden in the library, rather than exposed to the user (personally I see no reason why it should be exposed, but that's my personal opinion). While this discussion is, naturally, very early and no clear new solution is there, yet, it was clear (at least for me) that such a new solution (which doesn't have the problems state above) is likely not combinable with ULFM as it is now (would it go into the standard) without breaking the composability concepts in MPI itself.

dholmes-epcc-ed-ac-uk commented 7 years ago

I think that an investigation of the idea to allow 'sparse' communicators (which came up in this meeting) would be good.

The reasoning for needing shrink is because a new context id is needed for post-failure communication. The reason for needing a new context id is because the ranks will be re-ordered in the new communicator. This is a circular argument.

If the failed rank(s) can be excluded from a communicator without re-ordering the other ranks then the context id does not need to be changed. Shrink could be a local operation - ensure that all future messages from the failed rank(s) are ignored and decrement the size for collective operations.

This seems to circumvent many of the issues with the 'collective shrink and synchronous revoke' approach at the expense of the statement "ranks are always numbered from 0 to (size-1)". There are some implications, including

wesbland commented 7 years ago

@dholmes-epcc-ed-ac-uk I think it's more than just what you mentioned here. The biggest challenge, I believe, that we faced with non-shrinking recovery of communicators is that you'd also have to define how each of the collectives functions in the face of "gaps" in your communicator. For some (MPI_BARRIER), that's relatively easy. For others (MPI_REDUCE with the right/wrong operations), that's really hard. I'm not saying it's impossible, but I think it is probably not worth trying to tackle in the first go around, especially because it's something that could easily be added later without a bunch of problems of compatibility.

dholmes-epcc-ed-ac-uk commented 7 years ago

@wesbland I think that Martin's point was that if ULFM is put in before we know the impact of these other ideas then we could be doing the wrong thing. That could be used as a way to delay ULFM forever and I don't want to advocate or support that. On the other hand, do you have notes/documentation from your previous consideration of "non-shrinking recovery"? I'd like to review that, and the reasoning behind the "could easily be added later" statement. If we could do recovery without needing revoke or shrink, would they be deprecated and removed?

wesbland commented 7 years ago

This turned out to be quite the tome so please forgive the length, but I wanted to capture all of the discussion from yesterday in one place as there were many side discussions that went on.

I want to show the example that @schulzm gave that could potentially cause a deadlock with ULFM as currently specified.

Overlapping Communicator Failures

In this example, there is a green communicator and an orange communicator which overlap. Process 1 is communicating (point-to-point) with process 5 and process 3 is communicating (point-to-point) with process 4. Processes 3 and 5 simultaneously fail and trigger error handlers which revoke both the green and orange communicator. This sends all processes into error handlers where they attempt to repair the two communicators by shrinking. The problem here is that {0,3,4} will be in a call to MPI_COMM_SHRINK(green) while {1,2,5} are in a call to MPI_COMM_SHRINK(orange). This is a deadlock because 1 and 4 are needed in order to shrink either green or orange.

We tried to come up with a variety of scenarios which would let the user escape this situation. I'm probably going to forget some because I didn't take notes quickly enough, so add more if I leave some out.

  1. Aurelien - The user could post an MPI_IRECV(MPI_ANY_SOURCE, some_unused_tag) on MPI_COMM_WORLD (or similar) which would be used to detect failures besides the one that is currently being handled. By doing this before calling MPI_COMM_SHRINK, it could detect that there have been other failures from which it also needs to recover and attempt to do so on a communicator which emcompasses all of the failures.
    • Criticisms - This might still be unsafe due to the inherent race conditions present in the asynchronous detection and notification of failures. It also is a really gross way to tell users that they need to do their own failure detection. If this were absolutely required for safety, there should be some sort of API to help instead of a non-matching MPI_IRECV.
  2. Wesley - It might be possible to escape this problem with MPI_ISHRINK (planned for a follow-on issue). You can instead use MPI_TEST to do recovery and check for new failures in the meantime.
    • Criticisms - This might have a similar race condition problem as above, though the non-blocking nature might help. It is still subject to the "ick factor" of having to constantly be checking for new failures on unrelated communicators while doing recovery.

In the end, @schulzm suggested that the main problem is the combination of asynchronous notification (error reporting and MPI_COMM_REVOKE) and synchronous recovery (MPI_COMM_SHRINK). His opinion was that the best way to solve this problem would be to pull the notification and recovery portions of any fault tolerance solution into the MPI library itself and perform automatic recovery. (It's important to note here that automatic recovery in this sense is different than what has been discussed previously, where the intention was to replace failed processes with new processes, which the FTWG believes is not solvable. This is instead advocating for using either an internal shrink or to leave holes in the communicator, as @dholmes-epcc-ed-ac-uk suggested above.)

We then briefly discussed what a solution that works as @schulzm suggested might look like. One solution would be to just bring the revoke and shrink operations (as defined in ULFM) into the MPI library itself and not expose them to the user. Instead of returning MPI_ERR_PROC_FAILED to the user when a failure is detected, the MPI library will automatically begin recovery on that communicator. It would internally "revoke" the communicator and shrink the group of processes to remove failures. Then at some point in the future (probably on the next operation, but it could be delayed without a problem I believe), the user will see a new error, perhaps MPI_COMM_SHRUNK, indicating that the nature of the communicator has changed and the user should find out the new information about the communicator (size, rank, etc.) and then continue using the communicator with its new makeup. This has the benefit of being able to happen simultaneously with failure detection and recovery of as many other communicators as necessary (alleviating the deadlock situation described above), as well as exposing an easy interface for users (automatic, in-place recovery of communicators). However, it also has potential downsides when compared to current ULFM, primarily that it enforces a particular recovery model which the user might not want (the user might be ok with leaving the failures in place and communicating around them in the future, e.g. a simple master-worker style application where failed workers can be simply discarded).

There are some (probably many) unresolved questions around this:

Whatever we decide to do next, the concerns of @schulzm and @dholmes-epcc-ed-ac-uk (and others) are very valid. Namely, that we have to be very careful with whatever FT solution(s) we pass as a forum. We don't want to pick something that will exclude future solutions and even more importantly, we don't want to pick something that we would later discover has a fundamental flaw.

Hopefully this is a fair representation of everyone's opinions as expressed yesterday. I know that the ULFM authors (myself included) will want to think more about the concerns expressed to make sure we can't come up with a more elegant solution to resolve the problem before trying to re-architect the entire proposal.

wesbland commented 7 years ago

@dholmes-epcc-ed-ac-uk I completely agree with your statement about wanting to make sure we are forward compatible with FT and my remark about allowing both shrinking and non-shrinking recovery is more hopeful than certain at this point. A quick consideration seems to me that if we had all the semantics that we discussed yesterday (and I wrote about above), picking the particular style of communicator recovery is something that probably could be swappable while still being compatible with the other styles. We'd definitely have to work through that more extensively though.

As for the issues with leaving holes in the communicators, I don't think that it's an impossible problem. I think it's just a much more difficult problem than using shrinking recovery because it requires many more modifications to the rest of MPI. I don't have specific notes (or if I do, they're from years ago and hard to find), but I think my previous example of the two different collectives demonstrates how tricky the semantics of collectives could be if you didn't have dense communicators. I can expand a bit here:

If you have an MPI_REDUCE on a communicator with holes, what do the empty processes contribute? If you are doing MPI_SUM, it's simple to say that they contribute 0, but if you are doing multiplication or division, that's not a good answer. You could also say that they contribute nothing, but then it makes MPI_ALLTOALL very complicated because you would need to leave holes in your output buffers that you provide to users. None of these are impossible problems, just ones that we preferred to avoid if possible.

abouteiller commented 7 years ago

@dholmes-epcc-ed-ac-uk You can re-read the working group minutes from the previous ill-fated RTS proposal from 5/6 years ago. It has investigated heavily these ideas of replace/blank modes, and it was found that this approach, although seducing from a high mile perspective, had holes everywhere (pun intended) when you scratch the surface.

To name a few, you have to reconcile communicator names between respawned and surviving processes, hardware matching is difficult (possibly impossible) as distributed race condition make messages from previous "epochs" in the comm continue to be delivered for some time, collective operations have to be reimplemented and are going to be more expensive, etc.

I obviously strongly disagree with the way @schulzm frame the issue. The issue (as illustrated in @wesbland above) has a known solution that has been deployed in practice in developing the X10 programming language FT support. This solution (the one with iRecv from any source on an over-spanning communicator) is effective and bug-free. I can however concede the point that standardizing a solution to that problem (rather than letting users deal with it on their own in a more ad-hoc fashion) with some new interface can help foster a more interoperable environment for libraries that are not part of the same software package, and I think that at this point we should investigate this direction.

Obviously, feedback and participation to the WG to follow through is most welcome!

schulzm commented 7 years ago

@wesbland First of all, thanks for the extensive summary, I think this hits it fairly well. Only one clarification: The part of "In the end, @schulzm suggested that the main problem is the combination of asynchronous notification (error reporting and MPI_COMM_REVOKE) and synchronous recovery (MPI_COMM_SHRINK)." is not quite what I had in mind: I suggested that the combination of synchronous notification (as in, one has to actively call a particular set of operations on a the failing communicator, instead of being interrupted somehow) and collective recovery (all processes have to participate as in MPI_COMM_SHRINK) is the main problem.

Regarding "However, it also has potential downsides when compared to current ULFM, primarily that it enforces a particular recovery model which the user might not want" - I agree, this is a concern and we would have to look. It seems to me, though (in the short time we had to think about this), that this could be alleviated in a composable way and on a per-communicator basis. For each communicator, the user could decide whether they want automatic shrinking or whether they are OK with leaving gaps, which then would avoid triggering any recovery operation (and should even be lower cost than the current ULFM scheme?). The former would be targeted (probably) at end users, especially those wanting collectives, while the latter could a reasonable solution for runtimes that manage the process space themselves (like the mentioned X10 runtime).

One additional comment regarding: "This solution is workable for applications that can continue with fewer processes, but it still kludgy for apps that can't recover in that way easily" - yes, that is true. However, ReInit like solutions (at least the ones we looked at) rely already on the asynchronous notification scheme (and have to). Having a shrinking solution (which, I agree, other user groups can use or even need) based on the same notification concept has a much higher chance of being combinable towards a compatible solution that supports both. On the other hand, having two different notification schemes in MPI (one for shrinking and one for a reinit style, which we will need as well, since a majority of our current apps are that way) is likely to clash no matter what we do.

wesbland commented 7 years ago

I suggested that the combination of synchronous notification (as in, one has to actively call a particular set of operations on a the failing communicator, instead of being interrupted somehow) and collective recovery (all processes have to participate as in MPI_COMM_SHRINK) is the main problem.

Apologies. That's what I was trying to convey.

It seems to me, though (in the short time we had to think about this), that this could be alleviated in a composable way and on a per-communicator basis.

I completely agree here. That's what I was trying to get at in the conversation with @dholmes-epcc-ed-ac-uk. That seems like something that could interoperate more safely than completely changing the model. I think there are other things that might be able to do something with the non-shrinking model. Perhaps libraries such as Fenix from @marcgamell?

Having a shrinking solution (which, I agree, other user groups can use or even need) based on the same notification concept has a much higher chance of being combinable towards a compatible solution that supports both.

I'm not sure what you mean here. I would think that having an automatically shrinking (or not) solution might be worse because it doesn't give you the clean way of jumping to a clean state the way reinit does. It would still have the problems that reinit initially faced with ULFM, namely that all communicators, windows, files, requests, etc. would need to be tracked by the application and torn down in an error handler. Am I missing something?

dholmes-epcc-ed-ac-uk commented 7 years ago

@wesbland & @abouteiller thanks for the summary of the discussion at this meeting and the references to previous discussions. I'll take a look at RTS soon.

You used the term "non-shrinking" to describe the "leave gaps" approach. My understanding of that portion of the discussion was a little different. The communicator would not shrink, in that it contains the same number of processes (some failed and some non-failed), but would shrink, in that only non-failed processes would be expected to participate. There is no expectation or intent to replace failed processes with non-failed ones in this communicator. Therefore, "non-shrinking" is an incorrect characterisation; the active portion of the communicator does shrink.

@wesbland MPI_REDUCE with failed processes: failed processes do not participate. The comm size for the operation is the number of non-failed processes. The operation may fail with MPI_ERR_PROC_FAILED if a new (non-acked) process failure is detected during the operation (i.e. happened before the operation completed). MPI_ALL_TO_ALL with gaps: failed processes do not participate. The input/output buffers would be sized according the total number of ranks but the array elements related to failed processes would be unaffected by the operation. Compare with MPI_SEND/MPI_RECV to/from MPI_PROC_NULL, which succeeds but does not read/write data from/into the user buffer. Basically, treat failed processes exactly like MPI_PROC_NULL by extending semantics already present elsewhere in the MPI Standard.

@abouteiller reconcile communicator names between re-spawned and surviving processes: no processes are re-spawned so there is no issue here. Hardware matching is difficult (possibly impossible) as distributed race condition make messages from previous "epochs" in the comm continue to be delivered for some time: matching is can continue without interruption or change because no context id has changed and no ranks have changed. There is no "epoch" concept. Collective operations have to be reimplemented and are going to be more expensive: agreed, this is my main concern regarding this approach. A tree-based collective, for example, would need to rebuild the tree (a distributed agreement, in the worst case). Some topology structures are easier to repair than others. In many cases, if process X can calculate its parent/children processes for a particular topology (as a local operation) then it can calculate their parent/children processes too (also as a local operation). So, it could figure (locally) out how to skip over a failed process. Forcing a higher cost, even for the non-failure usage is going to be a difficult sell. Alternatively, build the initial tree/topology during communicator creation (known to be expensive) and rebuild it during failure recovery (known to be expensive) otherwise assume that the current topology is usable (normal usage has no performance degradation).

@schulzm the ReInit recovery could be built on top of the "leave gaps" thing by jumping back to the Init point whenever MPI_ERR_PROC_FAILED was raised. However, in order to guarantee all processes noticed this and did the same, the scope of the revoke-like notification should be expanded to any MPI function or made into an interrupt. If one process has jumped back to Init (or is about to), that involves destroying all communicators (+windows+files+requests+etc). There is no longer any point in letting other processes use any MPI object, even if it were safe for them to do so. On the other hand, the current notification mechanism (revoke), notifies that all processes in the affected communicator are failed, which is incorrect. The scope could be narrowed to a notification of particular process failures. A ReInit recovery model would react the same way because even a single process failure would cause a jump back to Init. Finer-grained recovery would be possible, though: continue with the same communicator (now 'sparse' or 'with gaps' with fewer non-failed processes) would be the default; spawn new processes and create a new communicator that is the same size as the original before failure would be the user responsibility (using MPI_COMM_SPAWN and MPI_INTERCOMM_MERGE, exactly as in ULFM examples of such recovery).

wesbland commented 7 years ago

@dholmes-epcc-ed-ac-uk I think I'd have to start trying to write text for the collectives before deciding if it were really going to be that hard. In principle, I think we agree what the "right thing" to do here would be as long as we can figure out the right words. You're right that I was conflating the two terms. I did mean "leave gaps" in the same way you did.

wesbland commented 6 years ago

Updated PDF for February/March 2018 Reading:

ulfm.pdf

abouteiller commented 3 years ago

New version: mpi40-ulfm-20200921.pdf

This is a 'parity' update to port the existing ULFM text over mpi-4.x, compatibility with new error handling, and pythonization.

abouteiller commented 2 years ago

New version:

ft-annotated-with-ackfailed.pdf

This is an update that substitutes the 'ack/get_acked' functions with 'get_failed/ack_failed' functions, as we decided during the last forum meeting.

Experimental version (for comment):

ft-annotated-with-modes.pdf

This adds a number of new error reporting modes that permit loosely synchronized and implicit recovery modes. This has been available as a branch on the wg-ft repository, just duplicating here so it is accessible to a wider audience.