Open devreal opened 1 year ago
This also came up recently in https://github.com/pmodels/mpich/pull/6751. We would support this change. cc @hzhou.
@devreal It is incorrect that MPI_Testsome
does set the MPI_ERROR
field. It only sets the MPI_ERROR
field if it returns MPI_ERR_IN_STATUS
. If it returns MPI_SUCCESS
, the MPI_ERROR
is unset. Similarly MPI_Testall/MPI_Waitsome/MPI_Waitall
. Now we also need cover MPI_Request_get_status_{all,some}
.
I think the rationale -
Rationale. The error field in status is not needed for calls that return only one status, such as MPI_WAIT, since that would only duplicate the information returned by the function itself. The current design avoids the additional overhead of setting it, in such cases. The field is needed for calls that return multiple statuses, since each request may have had a different failure. (End of rationale.)
makes sense. Users in general don't need to check the status MPI_ERROR
field and it is unnecessary to always require implementations to set MPI_ERROR
field, in the same sense that implementations are not required to set the MPI_SOURCE
or MPI_TAG
field in a send request status.
However, I am strongly against the interpretation that implementations must not touch the MPI_ERROR
field unless returning MPI_ERR_IN_STATUS
-
The field is updated if and only if such function returns with an error code of MPI_ERR_IN_STATUS.
The if and only if seems to force this interpretation.
In reality, it is often simpler for implementations to blindly update the entire statuses struct. To untouch certain field, e.g. MPI_ERROR
field, often forces implementations to add more code and instructions to preserve the field. This is, IMO, contrary to the rationale. IMO, the rationale is supposed to give implementation flexibility rather than adding restrictions without any rationale on the user's part.
Thus, I am against requiring implementations to always set the MPI_ERROR
field. But I am also against requiring implementations to if and only if set the field when returning MPI_ERR_IN_STATUS
. My suggestion is to leave the MPI_ERROR
field undefined unless MPI_ERR_IN_STATUS
is returned.
@devreal It is incorrect that MPI_Testsome does set the MPI_ERROR field. It only sets the MPI_ERROR field if it returns MPI_ERR_IN_STATUS. If it returns MPI_SUCCESS, the MPI_ERROR is unset.
I assumed the error case:
if the appropriate error handler is set
The current design avoids the additional overhead of setting it, in such cases.
Maybe this was true 20 years ago, it certainly isn't today. The potential saving (one load, one store) from allowing it is irrelevant on today's architectures so there is no rationale for allowing such "optimizations" in a public API.
My suggestion is to leave the MPI_ERROR field undefined unless MPI_ERR_IN_STATUS is returned.
That is at the heart of this issue: we should never leaves values undefined in our return values. They can easily cause undefined behavior and I hope we can agree that that is a bad idea. The trade-off between avoiding undefined behavior and saving two instructions should be clear.
"Premature optimization is the root of all evil."
Also #522
That is at the heart of this issue: we should never leaves values undefined in our return values. They can easily cause undefined behavior and I hope we can agree that that is a bad idea. The trade-off between avoiding undefined behavior and saving two instructions should be clear.
Is it only two instructions when MPI_{Wait,Test}all return MPI_SUCCESS and you require the implementation to set the MPI_ERROR field of every single status object to MPI_SUCCESS?
If we are so concerned about uninitialized data, shouldn't we make underflow illegal? Or should we require all buffers passed to receive functions to be initialized to ensure that users can never read uninitialized data?
No, because we have MPI_Get_count and the user is expected to determine how much of the buffer has been written, and not read more than that.
In the same manner, it is clear when the fields of status objects can/should be read. There is no serious threat from UB unless users write obviously incorrect code.
Is it only two instructions when MPI_{Wait,Test}all return MPI_SUCCESS and you require the implementation to set the MPI_ERROR field of every single status object to MPI_SUCCESS?
OK, one instruction per status, in the worst case. In the best case, the compiler is able to use wider move instructions if we don't have to poke holes into the status object. See OMPI for example: https://github.com/open-mpi/ompi/blob/main/ompi/request/request.h#L219 OMPI_COPY_STATUS
is called for every status on every request and we just distinguish whether the error field is copied or not (typically with a compile time constant so the branch in the macro is optimized out).
Or should we require all buffers passed to receive functions to be initialized to ensure that users can never read uninitialized data?
There is an important difference here: the buffer is updated the same way no matter how you send or receive your data. It doesn't matter whether you use MPI_Recv or MPI_Irecv, MPI_Send or MPI_Isend. The problem is that the resulting value of MPI_ERROR
is dependent on the procedure used to complete an operation. That's not true for message buffers.
In the same manner, it is clear when the fields of status objects can/should be read. There is no serious threat from UB unless users write obviously incorrect code.
I went through the standard to try and collect all the places where we define the semantics of MPI_ERROR
.
Section 3.2.5, at the top:
In C, status is a structure that contains three fields named MPI_SOURCE, MPI_TAG, and MPI_ERROR; the structure may contain additional fields. Thus, status.MPI_SOURCE, status.MPI_TAG, and status.MPI_ERROR contain the source, tag, and error code, respectively, of the received message.
Similar verbiage follows for Fortran. So the MPI_ERROR
field contains the error code of the received message?
Not so fast! Later in this section we say:
In general, message-passing calls do not modify the value of the error code field of status variables. This field may be updated only by the functions in Section 3.7.5 that return multiple statuses. The field is updated if and only if such function returns with an error code of
MPI_ERR_IN_STATUS
.
So this contradicts the first paragraph: status.MPI_ERROR
does not contain the status of the received message, unless there was an error in some tested/awaited request. But we still don't know exactly it is updated.
Section 3.7.3:
Error codes belonging to the error class
MPI_ERR_IN_STATUS
should be returned only by the MPI completion functions that take arrays of MPI_Status. For the functions that take a single MPI_Status argument, the error code is returned by the function, and the value of the MPI_ERROR field in the MPI_Status argument is undefined (see 3.2.5).
So now we say that the value of the MPI_ERROR
field is undefined after a call to a function that takes a single status argument. It says the value of the field is undefined, which is different from "updated if and only if" in the previous paragraph: are implementations allowed to change the field? "Undefined" suggests they can, "updated if and only if" says they cannot. But we still don't know how the field is updated if it is updated.
Section 3.7.5 (MPI_WAITALL
):
When one or more of the communication operations completed by a call to MPI_WAITALL fail, it is desirable to return specific information on each communication. The function MPI_WAITALL will return in such case the error code MPI_ERR_IN_STATUS and will set the error field of each status to a specific error code. This code will be MPI_SUCCESS, if the specific communication completed; it will be another specific error code, if it failed; or it can be MPI_ERR_PENDING if it has neither failed nor completed. The function MPI_WAITALL will return MPI_SUCCESS if no request had an error, or will return another error code if it failed for other reasons (such as invalid arguments). In such cases, it will not update the error fields of the statuses.
Now we know. There is more text explaining these rules in different colors for MPI_WAITSOME
and in Section 9.4.
What I am proposing is that we cut out all the special cases littered throughout the standard and stick to the semantics at the very top of 3.2.5: the MPI_ERROR
field contains the error code of the received message.
Save some trees!
Just so we're talking about what counts in terms of overhead, it isn't instructions - its memory references, especially writes. Getting a single byte out to memory, while it might happen in the background, can impact subsequent operations.
The status structure in OMPI is 24B (12B public, 12B private), The error field is 4B, or 16% of the total structure size. In each function call, we're writing several kB of data to memory. Most of that (and likely all of the status) will never make its way out of a cache and then its part of a cache line anyway. The overhead of storing 4 additional B is negligible. From where I stand, the discussion on overhead in this context is moot and we should focus on usability here.
If an application cares about ub, it should just initialize the error field before passing the status to MPI.
MPI_Status s = {.MPI_ERROR=MPI_SUCCESS};
should even initialize the whole object.
We only need to make sure then that MPI will never copy uninitiated memory to this field.
If an application cares about ub, it should just initialize the error field before passing the status to MPI.
MPI_Status s = {.MPI_ERROR=MPI_SUCCESS};
should even initialize the whole object.
That's not the problem. See my initial post. The problem is that some procedure calls set the field and others don't. Of course, to avoid UB you just don't make the mistake of reading that field after a call to MPI_Wait
or MPI_Waitany
. Good ol' "just don't pull the trigger on that footgun" advice :)
It is a strawman arguing for the instruction cost. The key points are:
status.MPI_ERROR
other than when aggregate completion calls return MPI_ERR_IN_STATUS
.MPI_ERROR
field when it is not returning MPI_ERR_IN_STATUS
is an implementation detail, since it does not affect the correctness of a correct user program.The cost argument is ultimately an implementation issue. For some implementations, it may be more efficient or easier always to set the status fields. But for some implementations that may be a cost or concern, say, due to some weird internal design. Anyway, until we determine it is necessary, why should the standard restrict the implementation to one way or another?
Of course, to avoid UB you just don't make the mistake of reading that field after a call to
MPI_Wait
orMPI_Waitany
. Good ol' "just don't pull the trigger on that footgun" advice :)
How is this different from "don't read the receive buffer until the operation is complete"? The specification states if and when one can read various parts of memory. We say when the user can and cannot read status.MPI_ERROR
.
The best solution for users who are confused is to just use MPI_STATUS(ES)_IGNORE
unless (1) they are using a bulk completion operation and errors will not stop program execution, (2) underflow is possible and they intend to call MPI_Get_count
or similar, or (3) they are using wildcards and .MPI_SOURCE
or .MPI_TAG
are needed.
Given that very few users allow errors to return in the first place, the hand-wringing over this issue seems undeserved. Writing error-aware code is nontrivial and it is not too much to ask to require that users only read .MPI_ERROR
when it is appropriate.
If we change the specification to allow but not require implementations to set the MPI_ERROR
field then the people who believe it's important can implement it and prove that it's trivial and free. That evidence would justify changing it to a requirement in the future.
Maybe I missed it, but any change here is backwards-incompatible, although I doubt any user code depends on the current behavior.
Regarding performance, I think some folks need a refresher on CPU microarchitecture.
When one stores any number of bytes to memory with an x86 store instruction, it causes RFO (request for ownership), which fetches the 64B lines (usually 1) containing those bytes. This means that the 4B store generates 64B of memory traffic from memory into cache and consumes a cache line, regardless of when the line is evicted. It also consumes an entry in one or more microarchitectural state buffers, such as the store buffer and - if the memory page was not already translated - a TLB entry. In the worst case, when the line is cold, the store will take 50+ nanoseconds (100-200 instructions), so we must hope that our CPU can have a sufficient number of instructions, stores and TLB walkers in-flight to hide this latency.
Note that in the case of MPI RMA request-based operations, MPI_ERROR
is the only valid field. If we unconditionally require MPI_ERROR
to be set, then we are forcing the implementation of bulk request completion operations to add the aforementioned overheads to what is otherwise a no-op in most cases. Of course, MPI_STATUS(ES)_IGNORE
should be used in most cases, but if I am writing an error-aware library, I can't do that, so now I pay the price of MPI_ERROR_IN_STATUS
every time, rather than only when I need it.
Do implementations mitigate this by checking whether errors return and only initializing the MPI_ERROR
field in that case? Is this a trivial change or not? I don't recall seeing any code like this in MPICH or OMPI. Is it possible to have a single flag for this or does one need to check multiple communications and/or sessions?
If we add the branch on can-errors-return to hide the aforementioned 4B store overheads, modern CPUs will still speculatively execute both sides of the branch, so we can still end up taking a TLB and cache miss anyways.
@jeffhammond In OMPI we had a lax interpretation of the The field is updated if and only if such function returns with an error code of MPI_ERR_IN_STATUS
requirement. We kept it simple (aka setting all fields) until very recently (2021, issue PR) basically until someone came up with a test from MPICH that was failing in OMPI and we had to fix
it.
Such a trivial matter resulting in such a waste of time. We can argue on how users shall be doing their coding, but if we put ourselves in their place for a little bit we would realize how ridiculous all this debate is. Writing correct distributed code is already hard enough without forcing the users to learn all the convoluted cases in which they should or should not test a field in a structure the user had explicitly requested to be returned.
My position is that we should relax the specification to allow both the OMPI 2020 and MPICH implementations. I'm sorry that you had to fix that, but it was a choice. You could have treated that trivial noncompliance the same as MPI_THREAD_MULTIPLE and RMA were treated for years.
Is it your position that we should now change the spec to force MPICH to change their implementation (and break backwards-compatibility of MPI)?
What is your proposal to fix the undefined situation with RMA and NBC requests, so that users do not have to learn that .MPI_SOURCE and .MPI_TAG can't be read in those cases?
It was a indeed a choice OMPI devs made to abide by the standard. Based on the fact that nobody complained about this change in a little over 2 years, I don't think we affected any codes because as you mentionned above 99.*% of users rely on the default error handler for MPI applications.
My position is "make it simple and consistent for the users", and if some implementations have to change their ways so be it. What @devreal proposed above seems very sensible, initialize the MPI_ERROR field all the time, and let the users decide how to use it.
A reasonable choice might be to explicitly state that the value of the field is undefined in these cases. That would allow both implementations and at the same time shout out that accessing the field is ub, even if the value was initialized before the call.
The argument of not returning uninitialized memory to the application would reach beyond the receive buffer, just think of the index array in waitsome.
initialize the MPI_ERROR field all the time, and let the users decide how to use it.
I will continue to object to this, because it forces implementations to do completely unnecessary operations in order to accommodate user illiteracy.
The majority of users don't handle errors at all. You are foisting backwards-compatibility, implementation changes, and runtime overhead onto every usage of status objects, in order to support illiterate users trying to write error-aware code.
How many MPI users do you know who write error-aware MPI code that is correct except for its use of the .MPI_ERROR field when the return code is not MPI_ERROR_IN_STATUS?
A reasonable choice might be to explicitly state that the value of the field is undefined in these cases.
This is what I want.
We should clean up the text to say very clearly that .MPI_ERROR is only relevant when the return code is MPI_ERROR_IN_STATUS. This is not a hard rule to learn.
This is what I want.
We should clean up the text to say very clearly that .MPI_ERROR is only relevant when the return code is MPI_ERROR_IN_STATUS. This is not a hard rule to learn.
This is what I would like to see as well.
If your proposal is to replace
The field is updated if and only if such function returns with an error code of MPI_ERR_IN_STATUS
with
The field is only relevant when functions return with an error code of MPI_ERR_IN_STATUS
then we are in complete agreement.
In OMPI we had a lax interpretation of the
The field is updated if and only if such function returns with an error code of MPI_ERR_IN_STATUS
requirement. We kept it simple (aka setting all fields) until very recently (2021, issue PR) basically until someone came up with a test from MPICH that was failing in OMPI and we had tofix
it.
I would love to remove those tests from MPICH testsuite. I don't understand why we went through so much trouble engineering those tests and fixing the code back then instead of protesting.
FWIW, RMA requests are the only requests where we do not have to touch the the status object if one is provided. For all other requests, the status object must at least contain some information on whether the request was cancelled (even in cases where cancellation is not allowed, like collective operations). That's a rather narrow justification for leaving the error field unset...
Here is a nugget from Section 3.7.2:
An empty status is a status that is set to return
tag = MPI_ANY_TAG
,source = MPI_ANY_SOURCE
,error = MPI_SUCCESS
, and is also internally configured so that calls toMPI_GET_COUNT
andMPI_GET_ELEMENTS
returncount = 0
andMPI_TEST_CANCELLED
returns false. We set a status variable to empty when the value returned by it is not significant. Status is set in this way so as to prevent errors due to accesses of stale information.
Based on the highlighted text, a user can reasonably assume that for all but receive requests the returned status is empty as defined here, i.e., has a well-defined content. After all, for all but receive requests the value of the status object is not significant if no error occurred.
@devreal This has been read and can still make it for MPI 4.2 on our current timeline. Did you want to bring it up for a vote at the December meeting?
Problem
Section 3.2.5 states:
I understand that for single-operation message passing calls (
MPI_RECV
for example) the error code is returned to the user, if the appropriate error handler is set. However, I would argue that the difference in behavior betweenMPI_Testany
(does not set theMPI_ERROR
field),MPI_Testsome
(does set theMPI_ERROR
field), andMPI_Test
(does not set theMPI_ERROR
field) is confusing for users and breaks with the principle of least surprise. Intentionally leaving a value undefined in some cases in a structure we return to the user can cause all sorts of issues if users are not careful. That's bad API design.Case in point: https://github.com/open-mpi/ompi/issues/12049
I should note that not all fields of the status are set in all cases. There are operations where
MPI_TAG
orMPI_SOURCE
have no meaning (e.g., RMA request-based operations). Leaving them undefined worries me less since this is a property of the operation and consistent across all test/wait functions. Every relevant MPI operation produces a return value so we should set the error field in every associated status object.Proposal
The
MPI_ERROR
field of the status should always be set, even if the error value is returned to the user directly (MPI_Test
, MPI_Recv`, ...).Changes to the Text
Strike the statement in Section 3.2.5 and add verbiage that the error field is always set.
Impact on Implementations
This could simplify implementations since they don't have to deal with the difference between single- and multi-request functions.
Impact on Users
Less surprises and exceptions to think about.
There may be applications out there that rely on the status not being set (i.e., carry information through the
MPI_ERROR
field across MPI function calls) but I expect the chances of that are very slim and it's bad design anyway. Such applications can easily be fixed by storing that information in a local variable instead.References and Pull Requests
Brought up in an Open MPI ticket: https://github.com/open-mpi/ompi/issues/12049