MotivationThe existing text is somewhat unclear about what error handler is called when an exception is raised in an operation like `MPI_WAITALL` where a bunch of requests are being operated on at the same time, some of which are about the same communicator and some about separate communicators. We need to better specify which error handlers are triggered and how many times. The specification of `MPI_ERR_PENDING` is also a little fuzzy and makes it unclear about whether triggering `MPI_ERR_PENDING` will cause the error handler to be called.

Proposed SolutionIn Section 8.3 “Error Handling”, page 342 lines 18 - 24, we should specify that each error handler should be called once per communicator. The alternative to this was to have the error handler called once per request, which is actually not as useful as it could be since the request object isn’t passed into the error handler. There isn’t a good way for the user to associate a request with the reason the error handler is being invoked. I propose adding the highlighted text on line 20:

The specified error handling routine will be used for any MPI exception that occurs during a call to MPI for the respective object. For MPI calls which operate on multiple requests, the error handling routine will be called once per communication object. MPI calls that are not related to any objects are considered to be attached to the communicator MPI_COMM_WORLD.

In Section 3.7 “Nonblocking Communication”, page 59 lines 35 it’s not clear that MPI_ERR_PENDING is only an error code and will not trigger the error handler associated with the request. So I propose adding the following text on line 38:

-_Advice to users. When returning MPI_ERR_IN_STATUS, the error handler for the communicator associated with a request object will only be called when the error code is not set to MPI_SUCCESS or MPI_ERR_PENDING. (End of advice to users.)_*

ImpactThis may require implementors to change their existing error handling code to ensure handlers are only called once. In MPICH, the behavior was to just call the first error handler that was being triggered and any other requests would not have their error handlers called.

This shouldn’t cause a backward compatibility issue however as this behavior was undefined previously. Users didn’t have a standard behavior to expect.

mpiforumbot commented 8 years ago

Originally by bouteill on 2015-01-21 07:34:28 -0600

The proposed text has the side effect of forcing to call the err handler on requests for which the status is set to MPI_SUCCESS or MPI_ERR_PENDING, which is undesired.

mpiforumbot commented 8 years ago

Originally by bouteill on 2015-01-21 07:38:11 -0600

About ERR_PENDING: should it be an advice to implementors, or should it be normative text ?

mpiforumbot commented 8 years ago

Originally by dholmes on 2015-01-22 05:04:47 -0600

For MPI calls which operate on multiple requests, the error handling routine will be called once per communication object.

What is relationship between a request and a communication object here? Are they identical? Could this be interpreted as "once per request"?

Which error code will be given to the error-handler if multiple requests for the same communication object fail with different errors?

The error-handler could provide information about the particular request that caused the error using the varargs parameters. How this information is formatted is currently implementation dependent, e.g. it could be the index in the request array or a pointer to the request itself.

Mandating fewer instantiations of the error-handler than the number of MPI exceptions does not seem like a good software engineering approach unless there is a way that a single instantiation can be given responsibility for multiple MPI exceptions.

In the absence of FT, MPI 3.0 p342 lines 43-48 make it very clear that the only dependable/portable behaviour for an error-handler is graceful abort. In this case, the second instantiation of an error-handler will never actually happen.

This text forces the MPI library to accumulate errors until all the requests have been processed before calling any error-handlers. If the first request fails, all others for the same communication object must be attempted in case one of them fails as well. Alternatively, the statuses for all others with the same communication object could be set to MPI_ERR_PENDING and the one error-handler could be called (once) and the MPI_WAITALL could return MPI_ERR_IN_STATUS. This suggests text like:

-_For MPI calls which operate on multiple requests, the error handling routine associated with the first MPI exception will be called exactly once. No other error-handlers will be called. The statuses for all successfully completed requests will be set to MPI_SUCCESS, the statuses for all incomplete requests will be set to MPI_ERRPENDING and the status for the one request that raised the MPI exception will be set to the appropriate MPI error code.*

mpiforumbot commented 8 years ago

Originally by @wbland on 2015-02-03 13:14:08 -0600

You're right. The text there should be more specific. Perhaps something like:

... For MPI calls which operate on multiple requests, the error handling routine will be called once per communication object for which an associated request raises an exception other than MPI_ERR_PENDING. ...

Replying to bouteill:

The proposed text has the side effect of forcing to call the err handler on requests for which the status is set to MPI_SUCCESS or MPI_ERR_PENDING, which is undesired.

mpiforumbot commented 8 years ago

Originally by @wbland on 2015-02-03 13:31:40 -0600

Replying to dholmes:

For MPI calls which operate on multiple requests, the error handling routine will be called once per communication object.

What is relationship between a request and a communication object here? Are they identical? Could this be interpreted as "once per request"?

My mistake. I thought "communication object" was well-defined to mean communicators, windows, and files. I can't find where I based that assumption. So s/communication object/communicator, window, or file/g.

Which error code will be given to the error-handler if multiple requests for the same communication object fail with different errors?

The error-handler could provide information about the particular request that caused the error using the varargs parameters. How this information is formatted is currently implementation dependent, e.g. it could be the index in the request array or a pointer to the request itself.

Mandating fewer instantiations of the error-handler than the number of MPI exceptions does not seem like a good software engineering approach unless there is a way that a single instantiation can be given responsibility for multiple MPI exceptions.

In the absence of FT, MPI 3.0 p342 lines 43-48 make it very clear that the only dependable/portable behaviour for an error-handler is graceful abort. In this case, the second instantiation of an error-handler will never actually happen.

This text forces the MPI library to accumulate errors until all the requests have been processed before calling any error-handlers. If the first request fails, all others for the same communication object must be attempted in case one of them fails as well. Alternatively, the statuses for all others with the same communication object could be set to MPI_ERR_PENDING and the one error-handler could be called (once) and the MPI_WAITALL could return MPI_ERR_IN_STATUS. This suggests text like:

For MPI calls which operate on multiple requests, the error handling routine associated with the first MPI exception will be called exactly once. No other error-handlers will be called. The statuses for all successfully completed requests will be set to MPI_SUCCESS, the statuses for all incomplete requests will be set to MPI_ERR_PENDING and the status for the one request that raised the MPI exception will be set to the appropriate MPI error code.

I don't have a problem with that as the solution either. The goal here is to create something that makes the behavior slightly more well-defined. It's hard to reason about the state of the requests if some arbitrary subset of them will trigger an error handler when others may not.

However, the version you propose (only allow a single error per call), would also force an implementation to mask errors if more than one is detected in the same pass through the progress engine. The internal request object would need to be able to keep track of error information and return it the next time it's called. I don't think that's a deal-breaker, just something that needs to be noted.

mpiforumbot commented 8 years ago

Originally by @wbland on 2015-02-03 13:35:15 -0600

Yes, this could/should be normative text.

Replying to bouteill:

About ERR_PENDING: should it be an advice to implementors, or should it be normative text ?

mpi-forum / mpi-forum-historic

Clarify behavior of multiple error handlers in a single call #472

ImpactThis may require implementors to change their existing error handling code to ensure handlers are only called once. In MPICH, the behavior was to just call the first error handler that was being triggered and any other requests would not have their error handlers called.