Inconsistent run-time (on x86_64) and compile-time folding in NaN production

wjristow commented 1 year ago

A customer of ours reported an inconvenience they hit due to the way we handle NaNs. They reported cases where at low optimization, expressions like:

  0.0f  *   std::numeric_limits<float>::infinity()

produced 0xffc00000 (for succinctness, say -nan), whereas with optimization, the result was 0x7fc00000 (say, +nan). The IEEE standard (IEEE Std 754™-2008) doesn't require any specific NaN representation, so both are legal. And hence by definition, this isn't a bug.

That said, it's a problem/annoyance for them, in that they intentionally initialize some floating-point values to a NaN by multiplying +0 times +inf. At low optimization, this results in 0xffc00000. But with optimization, it produces 0x7fc00000 (because of compile-time folding). They have testing tools that do bit-wise comparisons of results, and they want those bit-wise comparisons to match across optimization levels.

I initially thought that we folded the product of +0 times +inf to +nan because we folded in an "intuitively sensible" way (in that we produced a NaN with the sign bit set according to the XOR of the sign bits of the factors). And I expected that -0 times +inf (or +0 times -inf) would fold to -nan. (And I thought the hardware behaved "strangely", in that it always produced a NaN with the sign-bit set -- I tried a handful of x86_64 targets, and they all produced the same negative NaN when the multiplication was done at run-time, rather than folded at compile-time; although I admit that I cannot find an x86_64 hardware spec that asserts the multiplication will do that.) But on experimenting, I found that when we fold these products at compile-time, we always fold them to +nan, irrespective of the sign bits of the factors.

In short, when optimization is enabled and so the following expressions are folded at compile-time:

    0.0f  *   std::numeric_limits<float>::infinity()
  (-0.0f) *   std::numeric_limits<float>::infinity()
    0.0f  * (-std::numeric_limits<float>::infinity())
  (-0.0f) * (-std::numeric_limits<float>::infinity())

they all produce 0x7fc00000 (+nan). but if they are computed at run-time, they all produce 0xffc00000 (-nan).

If in this folding we produced a NaN with the sign-bit being the XOR of the sign bits of the factors (that "intuitive" way), then I was going to suggest a workaround to the customer of initializing their values using the expression -0 * +inf. In which case, with or without optimization, the -nan result would be produced (that is, whether folded to a constant at compile-time, or produced by multiplying the factors at run-time, the result would be the -nan). But since we always fold to +nan, that idea doesn't work.

Here is a test showing the Clang (trunk) result: https://godbolt.org/z/Es584o3eK

In that test-case, the compile-time computed results (that is, when compile-time folding is done) are all +nan, and the run-time results are all -nan.

As an experiment, I tried the same test-case with the Microsoft compiler (Version 19.29.30146 for x64), and it produced -nan for all the products (folded at compile-time, or computed at run-time). This is the case with and without optimization (/Od and /O2).

FTR, I also tried GCC, and different versions had different behavior. So there isn't much of a model that I can derive from that.

In summary, we could:

Do nothing (leaving this "inconsistent" behavior between values computed at compile-time vs at run-time, and hence often different behavior at different optimization levels).
Change our folding of these sorts of cases to produce a -nan (mimicking the Microsoft behavior, and being consistent across optimization levels (at least for hardware that produces -nan for these products)).
Change our folding to make the sign-bit of the NaN be the XOR of the sign bits of the two factors (creating the opportunity to write code that folds at compile-time in a way that matches the hardware behavior, and hence is handled consistently across optimization levels; but doesn't mimic the Microsoft behavior).

What do people think? I lean toward option 2, although I can see an argument for option 3 (especially if other hardware always produces a +nan for cases where the product is computed at run-time). There's also an argument for option 1, given that the IEEE Standard doesn't specify the behavior (and so users can not safely write code that depends on a particular behavior).

wjristow commented 1 year ago

cc @rotateright @LebedevRI

efriedma-quic commented 1 year ago

I don't see any downside to making LLVM constant folding producing the same NaN the target would produce. (See also #60796.)

On the clang side, I'm a little concerned about potential ABI implications, but it looks like clang refuses to fold arithmetic on NaNs at the moment, so the effects would be limited, I think.

llvmbot commented 1 year ago

@llvm/issue-subscribers-backend-x86

llvmbot commented 1 year ago

@llvm/issue-subscribers-clang-codegen

phoebewang commented 1 year ago

although I admit that I cannot find an x86_64 hardware spec that asserts the multiplication will do that.)

SDM 4.8.3.7:

4.8.3.7 QNaN Floating-Point Indefinite
For the floating-point data type encodings (single precision, double precision, and double extended precision), one
unique encoding (a QNaN) is reserved for representing the special value QNaN floating-point indefinite. The x87
FPU and the Intel SSE/SSE2/SSE3/SSE4.1/AVX extensions return these indefinite values as responses to some
masked floating-point exceptions. Table 4-3 shows the encoding used for the QNaN floating-point indefinite.

SDM 4.2.2 Table 4-3	class	Sign	Biased Exponent	Integer	Fraction
QNaN FloatingPoint Indefinite	1	11..11	1	10..00

nikic commented 1 year ago

I don't see any downside to making LLVM constant folding producing the same NaN the target would produce. (See also #60796.)

How would constant folding know what the correct NaN to use is? Would this be a DataLayout property? I'm not really willing to make constant folding TTI-dependent. (Even just ensuring that the necessary code has access to DL will take some work, though we are in a pretty good position now that most constant expressions working on floats have been removed.)

efriedma-quic commented 1 year ago

The datalayout probably makes the most sense, yes. Although updating it is a bit painful...

wjristow commented 1 year ago

although I admit that I cannot find an x86_64 hardware spec that asserts the multiplication will do that.)

SDM 4.8.3.7:

Thank you @phoebewang! Very happy to see that.

wjristow commented 1 year ago

Glad to see the general agreement here.

The one difficulty seems to be:

The datalayout probably makes the most sense, yes. Although updating it is a bit painful...

That's an area I don't have experience with. The "a bit painful" remark scares me a bit. 😃 I don't have the bandwidth to look into this now, but I'll be happy to get back to it eventually. So if someone else wants to pick this up, that would be great.

llvm / llvm-project

Inconsistent run-time (on x86_64) and compile-time folding in NaN production #61973