openshmem-org / specification

OpenSHMEM Application Programming Interface
http://www.openshmem.org
51 stars 41 forks source link

Reductions and NaN values #467

Open nspark opened 3 years ago

nspark commented 3 years ago

Description

Currently, the Specification does not specify the handling of NaN values in reductions over floating types.

Per C18 §7.12.14-1, a NaN value is unordered with respect to a numeric value or another NaN. For example, it is not clear what the result of shmem_double_max_reduce or shmem_double_max_to_all should be in the presence of NaN values.

In C, NaN values can be initially unintuitive; for example:

#include <math.h>
#define MAX(a, b) ((a) > (b) ? (a) : (b))

MAX(1.0, NAN) == NAN;
MAX(NAN, 1.0) == 1.0;

C provides fmax and fmin to handle these situations gracefully:

#include <math.h>
fmax(1.0, NAN) == 1.0;
fmax(NAN, 1.0) == 1.0;
fmax(NAN, NAN) == NAN;

In the tests I've performed on OpenSHMEM implementations readily accessible to me (certainly not all that exist), none handle min/max reductions correctly for NaN values. They did seem to handle sum reductions correctly.

Suggestions

Considerations

nspark commented 3 years ago

I expected MPI might have plenty to say about handling NaN values, but all I see is the following:

According to IEEE specifications, the “NaN” (not a number) is system dependent. It should not be interpreted within MPI as anything other than “NaN.”

Advice to implementors. The MPI treatment of “NaN” is similar to the approach used in XDR (see ftp://ds.internic.net/rfc/rfc1832.txt). (End of advice to implementors.)

nspark commented 3 years ago

To implementors/vendors: Are there performance concerns if the result (e.g., of a MAX reduction where some entries are NaN values) is implementation defined but required to be single-valued? That is, all PEs would be expected to return the same value. (This result is not currently the case on all implementations.)

jdinan commented 3 years ago

Not for MAX, but for an arithmetic operation like SUM, there could be an associativity requirement in order to ensure that all PEs get identical results.

nspark commented 3 years ago

Some concerns that have been raised (off issue, obviously) are that "proper" NaN behavior may require an additional collective (to test for NaNs), which is not desirable.

Again, in my (limited) testing, implementations handled the sum-reduction properly in the face of NaN values. The max (and, presumably, min) reductions were what were not necessarily returning the same value on all PEs.