More accurate floating-point rounding in X86 FMA extension instructions

lifting-bits / remill

Library for lifting machine code to LLVM bitcode

Apache License 2.0

1.27k stars 145 forks source link

More accurate floating-point rounding in X86 FMA extension instructions #168

Open mike-myers-tob opened 6 years ago

mike-myers-tob commented 6 years ago

The FMA extension instructions, like VFMADD213SD, perform multiplications and additions or subtractions of floating-point values, given several floating point values as operands. When performing, for example, value2 * value 3 + value 1, the FMA instruction supposedly performs the multiplication as an "infinite precision intermediate result" before adding the last value and then performing rounding to store the result. We have been unsuccessful in trying to model this.

It does not seem to be an issue of precision and the C++ double type. Using boost::multiprecision and its arbitrary-precision binary floating point data type with 100 digits, the result of an operation like this can be calculated with more accuracy. But once converted back down to store in a double, the result is exactly the same as doing double foo = (value2 * value 3 + value 1) directly.

It seems like an issue of rounding. The native result and lifted result of the instruction differ by only the least significant bit. The C++ rounding style for floating point is called "to nearest". In testing the behavior of VFMADD213SD, it seems to sometimes generate results equivalent to C++ rounding style FE_DOWNWARD and other times its results match that of the C++ rounding style FE_UPWARD. In other words, controlling fenv seems to only help us match its behavior 50% of the time.

ranweiler commented 6 years ago

Hi! This bug made me curious, but I'm having trouble reproducing it locally.

Based on this comment, my expectation is that adding the inputs 1.2, 2.3, 3.4 to the constant VFMADDSD_INPUTS_64 should elicit a failure. Is that right? If so, I'm unable to locally repro on an Intel Core i7-6700K on Linux 4.14.5. I also can't repro on CI (example branch build on my account here).

mike-myers-tob commented 6 years ago

Well that's good to hear, and I can revisit this.

I think Peter added something to the test runner that ignores small floating point rounding errors though, could this be covering up the issue? https://github.com/trailofbits/remill/blob/master/tests/X86/Run.cpp#L538

pgoodman commented 6 years ago

That's for x87 precision differences, not sse/avx ones.

mike-myers-tob commented 6 years ago

For not completely understood reasons, this bug went away on its own (perhaps our migration to a newer LLVM, perhaps our improvements to state modeling). Thanks for noticing @ranweiler !

mike-myers-tob commented 6 years ago

Yea, so the reproducibility of the failing test case was temporarily hidden by issue #206 but it looks like this when the test case runs as it should:

gdb ./tests/X86/run-amd64_avx-tests -ex "b /path/to/remill/tests/X86/Run.cpp:568" -ex "handle SIGILL nostop noprint" -ex "handle SIGFPE nostop noprint" -ex "set \$native = (State *)&gNativeState" -ex "set \$lifted = (State *)&gLiftedState"
(gdb)  p/x $native->vec[0].xmm.sdqwords
$2 = {
  elems = {0x00000000000000004018a3d70a3d70a3}
}
(gdb)  p/x $lifted->vec[0].xmm.sdqwords
$3 = {
  elems = {0x00000000000000004018a3d70a3d70a4}
}

ranweiler commented 6 years ago

@mike-myers-tob, I think this issue might really be about the precision of C++ double. I came up with the following conceptual repro, based upon your known failing test case of 1.2, 2.3, 3.4. Restating for reader convenience, in the VFMADDSD tests, the native result is 0x4018a3d70a3d70a3 and the lifted is 0x4018a3d70a3d70a4.

The following C program with the GNU GMP library reproduces the discrepancy above:

#include <gmp.h>
#include <stdio.h>
#include <stdint.h>
#include <string.h>

void print_f64(double f) {
    uint64_t raw = 0;
    memcpy(&raw, &f, sizeof(raw));

    printf("f64: %f\n", f);
    printf("raw is %lx\n", raw);
}

int main() {
    mpf_t a, b, c, t;
    double out;

    mpf_init_set_d(a, 1.2);
    mpf_init_set_d(b, 2.3);
    mpf_init_set_d(c, 3.4);
    mpf_init(t);

    // t := b*a
    mpf_mul(t, b, a);

    // t := t + c
    mpf_add(t, t, c);

    // out = round_nearest(t)
    out = mpf_get_d(t);

    // Implements FMA spec, agrees with native result.
    print_f64(out);

    // Implements current lifted semantics and result.
    print_f64(2.3*1.2 + 3.4);

    mpf_clear(a);
    mpf_clear(b);
    mpf_clear(c);
    mpf_clear(t);
}

I note that the printf output with %f for each value as a double is the same, even though we can see the difference in the raw bytes.

I've poked around a bit, and it seems like there may be a good modeling approach indicated in the paper "Emulation of FMA and correctly-rounded sums: proved algorithms using rounding to odd": https://www.lri.fr/~melquion/doc/08-tc.pdf.

ranweiler commented 6 years ago

Okay, so the algorithm in the above paper seems to work well. Partly to satisfy my own curiosity, I now have a fairly thoroughly fuzz-tested implementation (whose test harness I can use to generate further failure cases for the existing semantics). It uses a few tricks to emulate a single-rounding FMA via f64-only arithmetic.

Another option is trying to directly leverage the llvm.fma intrinsic, though I think that'd be breaking new ground, and it isn't at all clear to me what semantics or guarantees users should expect from the LLVM intrinsic. It may be strictly preferable take the former approach, for that reason.

mike-myers-tob commented 6 years ago

Ah! I will read the paper you linked from the year 2099 (?), and if you've already implemented something then I'm interested in that too. Briefly looking at the paper though, it seems like the root problem is a rounding behavior called "round to odd" that is performed by the FMA instructions but is not a mode supported by the x87 FPU, is that right? This would definitely explain the results I'd seen.

Because we avoid GPL code, I had not tried anything with GNU GMP, but I had done similar tests with Boost multiprecision. But that was only to help me troubleshoot, because I believe Peter also wanted to avoid a dependency on Boost.

ranweiler commented 6 years ago

the year 2099 (?)

Welcome to the future! :rocket:

Briefly looking at the paper though, it seems like the root problem is a rounding behavior called "round to odd" that is performed by the FMA instructions but is not a mode supported by the x87 FPU, is that right?

Yep, rounding-to-odd is not a rounding mode that's part of the x87 FPU (or the IEEE standard), and that's why we can't model the correct behavior just by adjusting the floating point environment. The algorithm in that paper actually uses both round-to-odd (which we can implement as a subroutine), and then the usual FE_TONEAREST. Hardware FMA designs that I've scrounged up don't appear to explicitly use or need round-to-odd. I'll stick the FMA emulation code in a repo shortly, to give you something to chew on.

Because we avoid GPL code, I had not tried anything with GNU GMP, but I had done similar tests with Boost multiprecision.

For sure! The GMP code was similarly just to give you something concrete to check, since the initial Boost experiments weren't surfacing the bug. The FMA emulation does not require bignums or any new dependencies.

ranweiler commented 6 years ago

Okay! Back from the holidays, with some conclusions. My experiments in FMA emulation can be found here.

I think the round-to-odd based emulation is a no-go, for both theoretical and practical reasons. Why:

The RTO proof/algo permits spurious NaNs. The guarantee is only that if a finite result is returned, it is correct. In my property testing, about 1.3% of all random floats exhibited this behavior, where the real FMA instruction returned a finite value, but the emulated algorithm returned a NaN.
I improved the property testing and found degenerate cases at very small argument values, which violate assumptions of the main round-to-odd FMA theorem. This occurred for about 0.001% of inputs. The end result is an emulated finite value that is off by the least significant significand bit.

IMO, the RTO algo is only worth including if it can match the FMA instruction exactly. If not, we are adding somewhat fragile complexity for no real benefit.

I did some more research, and found some work that analyzes the circumstances in which double-rounding is innocuous. The punch line is: if our inputs have significand precision p, and we perform floating point arithmetic at precision 2p + 1 (with intermediate rounding at that higher precision), then round back to p, it will be as if we had done a single rounding.

This suggests that if we implement f128 arithmetic (which in the usual IEEE format has precision 112 > 107 = 53*2 + 1, we can emulate FMA for f64's, which have p = 53. We could do this by implementing only f128 add/mul ourselves (subtle, maintenance burden), by using Boost/GMP/GMFR (the latter two are LGPL), or by looking for some other tiny quad-precision float library (rather than a full MP lib). I note that we have a similar issue with f80 registers in #199. This could address that problem, too.

Finally, we could make FMA a Remill intrinsic. A Remill consumer could then e.g. implement that intrinsic with an llvm.fma LLVM intrinsic, with regular/lossy f64 arithmetic, or something else, depending on what is important for their use case. This could be a compile-time option for Remill consumers who really need to avoid precision loss.

I don't really like the idea of adding an intrinsic, and my gut feeling is that we could come up with a very compact f128 add/mul implementation. We don't need to care about efficiency or transcendental floating-point operations, which means we can focus on making the impl simple and correct.

What do you think?

pgoodman commented 6 years ago

I am curious about the case of spurious NaNs in the round-to-odd algorithm, and whether or not the current non-fused operations produce NaNs in those cases. If the unfused multiple+add does not produce a NaN where the fused one does, then this is interesting and possibly useful.

The impact of leaving this as-is, or going the intrinsic route (not bad, but yeah, I don't want to just start collecting intrinsics for every issue), is not clear to me. If compilers auto-fuse instructions, they they are making some programs "more precise." Other programs, e.g. linear algebra ones, likely explicitly use such operations.

A key thing for me when evaluating whether or not to make some semantics more precise is that I want the bitcode produced to be an "analysable" artefact in and of itself, and so not rely on any third-party libraries, so any dependencies on third-party libraries would need to be hidden by an intrinsic.

ranweiler commented 6 years ago

[...] whether or not the current non-fused operations produce NaNs in those cases.

In my property tests, whenever the FMA instruction produced a non-NaN and RTO emulation produced a NaN, the unfused/double-rounded f64 MA produced a non-NaN. In other words, unfused arithmetic has a softer failure mode wrt emulation (lost precision).

An input for such a case:

    double a = 0x1.8f092f46d0ep-344;
    double b = 0x1.25262c8884cp+1008;
    double c = -0x1.33208c12178p-544;

If you step through the emulation code, in the second call of the split() function in exact_mul(), the line double c = f * a gets rounded to +inf, which makes the next line become -nan, which then propagates.

Some numbers from random testing:

Appx 1.3% of inputs yielded a case where the actual FMA was non-NaN but the RTO emulation was NaN.
Of those cases, 99.4% of the time, the result of the unfused MA equals the true FMA.
If we patch fma_emu_rto() to fall back on unfused MA if its result is a NaN, the total failure rate is 0.009%. If we don't fall back, then the failure total failure rate (due to spurious NaNs or a discrepant least coefficient bit) is 1.3%.

The punch line is that falling back on unfused arithmetic seems to solve the spurious NaN problem, but doesn't get us bit-exact emulation. For that reason, adding RTO emulation doesn't seem worth the maintenance burden.

If compilers auto-fuse instructions, they they are making some programs "more precise."

True. In my mind, it would be nice to be able to observe (in the lifted code) if the original compiler (or programmer) did that fusing or not, and where. If I tried to have my compiler automatically recover the original fusing when recompiling the lifted bitcode, I might add FMAs where they didn't previously exist, and miss others. On the other hand, it may not be important for the types of analysis we care about.

I want the bitcode produced to be an "analysable" artefact in and of itself

This criterion seems right to me. To that end, as a Remill library consumer, I think I would prefer an optional intrinsic to a mandatory low-precision approximation.

If I don't care about perfect numerical precision, I can implement the intrinsic using normal double-rounding. Ideally I could just tell Remill to do that for me in advance, resulting in simpler lifted bitcode. If I do care about precision, then asking for an intrinsic means I can use a library, emit a concrete fused instruction, &c. In that sense, an intrinsic does seem more analyzable than, say, a Remill-generated implementation of f128 arithmetic. An optional intrinsic seems to be most analyzable.

This largely loops back to the other issue (#199) about using f64s to model f80 registers, and whether supporting functional round-tripping is a goal or non-goal.

pgoodman commented 6 years ago

It's been a few says since I've thought about this, but I think it comes down to something like:

Implement a double __remill_fused_multiply_and_add(double, double, double) (the parameter types read like a Tim Horton's menu), where McSema could lower it to an llvm.fma if that's present in the LLVM version, or the runtime could implement it using the combination that you did, of fma_emu_rto falling back to unfused.