llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.25k stars 12.08k forks source link

optimize reciprocals with fast-math (x86) #21759

Open rotateright opened 10 years ago

rotateright commented 10 years ago
Bugzilla Link 21385
Version trunk
OS All
CC @hfinkel,@RKSimon,@rnk,@tycho

Extended Description

This is the sibling to bug 20900 (square root estimates).

We could be generating RCPSS / RCPPS instead of divisions when -ffast-math allows it.

I thought this would be simpler than square root, but we'll need to again measure the performance gains per CPU to decide where to enable it, and we may need to offer a 2nd Newton-Raphson refinement implementation for x86 that lacks FMA.

I've also written a test program (below) that tries to confirm the goodness of the estimate. It shows that one N-R step is inadequate to get us to 1 ULP of the IEEE result on both Intel Sandy Bridge and AMD Jaguar.

Most unfortunately, we get the wrong answer for a reciprocal of 1.0f! This is why RCPPS isn't generated with gcc -ffast-math.

Both Intel and AMD manuals say that the estimate instruction provides almost 12-bits of accuracy, and one N-R step should double that accuracy to 24-bits which is good enough for a single-precision float, but no...that "almost 12-bits" is definitely not good enough. We need two N-R steps.

Test program:

include

include

include

include

float recip_ieee(float x) { return 1.0f/x; }

float recip_est(float x) { float est; m128 vecX = { x,0,0,0 }; m128 vecEst = _mm_rcp_ss(vecX); _mm_store_ss(&est, vecEst); return est;
}

float recip_est_NR1(float x) { float est; m128 vecX = { x,0,0,0 }; m128 vecEst = _mm_rcp_ss(vecX); _mm_store_ss(&est, vecEst);

est = est + est * (1 - est * x);
return est;  

}

float recip_est_NR2(float x) { float est; m128 vecX = { x,0,0,0 }; m128 vecEst = _mm_rcp_ss(vecX); _mm_store_ss(&est, vecEst);

est = est + est * (1 - est * x);
est = est + est * (1 - est * x);
return est;  

}

int main() { int i; unsigned int tests = 0; unsigned int not_exact_NR1 = 0; unsigned int not_exact_NR2 = 0; unsigned int bad_NR1 = 0; unsigned int bad_NR2 = 0; // ignore denorm inputs (anything less than 0x00800000) // ignore denorm outputs (any input over 0x7e800000) // ignore inf (0x7f800000) // ignore nans (anything more than 0x7f800000) // ignore negative (anything more than 0x80000000) for (i = 0x00800000; i < 0x7e800000; i++) { float f, ieee, est, est_nr1, est_nr2; int ieee_int, est_int, est_nr1_int, est_nr2_int;

    tests++;
    memcpy(&f, &i, 4);
    ieee = recip_ieee(f);

// est = recip_est(f); est_nr1 = recip_est_NR1(f); est_nr2 = recip_est_NR2(f); memcpy(&ieee_int, &ieee, 4); // memcpy(&est_int, &est, 4); memcpy(&est_nr1_int, &est_nr1, 4); memcpy(&est_nr2_int, &est_nr2, 4);

// if (i % 0x01000000 == 0) printf("0x%08x\n", i); // just a status print...

    if (ieee != est_nr1) {

        not_exact_NR1++;
        if (abs(ieee_int - est_nr1_int) > 1) {
            bad_NR1++;

// printf("0x%08x: %e, ieee = %e (0x%08x), est_nr1 = %e (0x%08x), est_nr2 = %e (0x%08x)\n", // i, f, ieee, ieee_int, est_nr1, est_nr1_int, est_nr2, est_nr2_int); } } if (ieee != est_nr2) { not_exact_NR2++; if (abs(ieee_int - est_nr2_int) > 1) { bad_NR2++; } } } printf("Total tests = %d\n", tests); printf("Inexact results with one N-R step = %d\n", not_exact_NR1++); printf("Results off by > 1 ULP with one N-R step = %d\n", bad_NR1); printf("Inexact results with two N-R steps = %d\n", not_exact_NR2++); printf("Results off by > 1 ULP with two N-R steps = %d\n", bad_NR2); return 0; }

rotateright commented 8 years ago

sqrt_accuracy_tester.c

rotateright commented 9 years ago

This should match gcc's defaults.

That was wrong: bug 24063. It's more complicated than I thought.

rotateright commented 9 years ago

After: http://reviews.llvm.org/rL240310

I think we finally have the behavior that was requested in comment 23 (for x86 only though): we generate square root and reciprocal estimate instructions with -ffast-math except for scalar reciprocal estimates. This should match gcc's defaults.

The interface is identical to gcc: https://gcc.gnu.org/onlinedocs/gcc-4.9.2/gcc/i386-and-x86-64-Options.html#index-mrecip_003dopt-1627

except for one addition: you can control the number of estimate refinement steps using something like this: $ clang -O2 -ffast-math -mrecip=sqrtf:2

The '2' means use two refinement steps when generating a reciprocal square root estimate for a single-precision (float).

The only thing missing (I hope): it's completely undocumented!

I assume something should be added here: http://clang.llvm.org/docs/UsersManual.html#controlling-code-generation

and we should have a line of helptext for the clang option, so you at least know the option exists with "clang -help".

rotateright commented 9 years ago

There's no way we can do what the customer wants in all cases, so give them more control: http://reviews.llvm.org/D8982 http://reviews.llvm.org/D8989

rotateright commented 9 years ago
  1. It sounds like people would be satisfied with only using the reciprocal square root estimate by default in vector operations. That is, there's not sufficient justification to enable it for scalar operations because of the latency impact. Is that correct? If so, what would need to change to make it only enable the rsqrt estimate for vector operations?

No - the vector versions of the real instructions and the estimate instructions tend to have the same latency/throughput as the scalar versions. (There are special cases like 256-bit vectors on Jaguar, where everything has to be double-pumped for 256-bit vectors.)

So in general, we have the same issue with vectors as scalars. As long as you can show that N pipelined executions of the vector estimate sequences are faster than their full-precision counterparts, then we could enable that codegen by default.

rotateright commented 9 years ago
  1. Aren't the instruction scheduling definitions in lib/Target/X86/X86SchedSandyBridge.td and lib/Target/X86/X86SchedHaswell.td sufficient for LLVM to make smart decisions about whether or not the latency cost of the reciprocal square root estimate is worth it?

I agree that the sched defs should provide more info to make a decision like this, and I left FIXME comments in X86TargetLowering::getRsqrtEstimate() and X86TargetLowering::getRecipEstimate() about that.

But when I looked into it, I didn't see a way to get that info into X86TargetLowering. It's entirely possible that I just overlooked some API to make it happen.

Note that the vectorizers don't use the sched timings either. They use an alternate cost model (bug 21356, comment 5)...I don't know why we have this other cost model.

The harder problem, as I wrote in the comments, is that Intel chips (and AMD's?) have logic in the FPU HW that makes div/sqrt timing variable based on the values of the operands. I don't see any way to account for that statically. We might be able to detect some simple cases with some kind of FP value tracking (see bug 17713), but there's no way to solve that in general.

tycho commented 9 years ago

Some naive questions:

  1. Aren't the instruction scheduling definitions in lib/Target/X86/X86SchedSandyBridge.td and lib/Target/X86/X86SchedHaswell.td sufficient for LLVM to make smart decisions about whether or not the latency cost of the reciprocal square root estimate is worth it?

  2. It sounds like people would be satisfied with only using the reciprocal square root estimate by default in vector operations. That is, there's not sufficient justification to enable it for scalar operations because of the latency impact. Is that correct? If so, what would need to change to make it only enable the rsqrt estimate for vector operations?

rotateright commented 9 years ago

Isn't hiding this behind -ffast-math already a sufficient "use at your own risk" caveat?

Sure, but -ffast-math is a big hammer. We don't want users to have to turn that off entirely just because the estimate codegen wasn't accurate or fast enough. We're providing fine grain tuning with the other flags (see also bug 20912).

There are 2 issues though, so let's try to separate this:

(1) Is the accuracy good enough to enable this by default?

(2) Is the performance better enough to enable this by default?

I think I've already answered (1) with a "yes" when using one N-R step. The data for Jaguar, Sandy Bridge, and Haswell is in this bug report. SPEC be damned. This is just restating what Hal said in comment 24.

If one N-R step isn't good enough (or the app can tolerate even less accurate results), the user can change that setting.

For (2), we already know that latency for an estimate sequence won't be as good on the Intel chips but throughput will improve.

We should determine exactly how many parallel calcs are needed to make the estimate code perform better. I think that answer will be just 2 or 3...that seems like a low enough cross-over that we would then feel justified in enabling the codegen for the chip in question. Ie, we wouldn't expect most real world code to be perf-limited by the latency of 1 or 2 div/sqrt ops.

tycho commented 9 years ago

Isn't hiding this behind -ffast-math already a sufficient "use at your own risk" caveat?

rotateright commented 9 years ago

I think you need to be careful about setting criterion based on "break[ing] real-world code" when talking about precision-reducing optimizations like this. No matter what accuracy threshold you set, you'll always find something you break.

Exactly - if you review the data in comment 7 and comment 8, you'll see that rsqrt estimates actually have worse accuracy using a single N-R step than rcp estimates (relative to the IEEE answer). But GCC lets that go because it doesn't affect SPEC.

Every app will have a different tolerance for FP accuracy, so we're allowing the user to turn that codegen on/off, and in the case of rcp, we even have a knob (although poorly implemented with a cl::opt) to change the number of N-R steps. I'd like to add a similar N-R step param for rsqrt, but I was holding off pending some kind of resolution for how to do that better (see: bug 21471 and http://lists.cs.uiuc.edu/pipermail/llvmdev/2014-November/078785.html).

hfinkel commented 9 years ago

Ugh. Is there a reason we didn't see precision issues with rcpss in your testing? Microarchitecture differences? Missed some edge cases?

If rcpss breaks real-world code then that's one thing.

I think you need to be careful about setting criterion based on "break[ing] real-world code" when talking about precision-reducing optimizations like this. No matter what accuracy threshold you set, you'll always find something you break.

We can't test every possible code sequence under the sun or we'd never get anything done, but we can run it through smoke testing. And we can always revert a change if it breaks something.

I'd like to figure out a sane criteria for enabling these optimizations. How can we proceed? (I'm hoping for specific steps here because I don't want to see this fall by the wayside.)

Sanjay has posted his accuracy testers, and running those would be good. I recommend testing speed with an N-body code, or similar kernel. That should be sufficient.

tycho commented 9 years ago

Ugh. Is there a reason we didn't see precision issues with rcpss in your testing? Microarchitecture differences? Missed some edge cases?

If rcpss breaks real-world code then that's one thing. We can't test every possible code sequence under the sun or we'd never get anything done, but we can run it through smoke testing. And we can always revert a change if it breaks something.

I'd like to figure out a sane criteria for enabling these optimizations. How can we proceed? (I'm hoping for specific steps here because I don't want to see this fall by the wayside.)

rotateright commented 9 years ago

GCC does not, however, emit reciprocal estimates, even with -march=haswell. It's possible that GCC does not implement any selection of RCPSS:

This is explained by: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55760#c1

Using rcpss broke SPEC2006, so they decided to hide that codegen behind another flag. I don't think we want to play that kind of benchmarking game with LLVM.

tycho commented 9 years ago

I'm not sure I'm the best person to come up with a set of tests for this issue, but if someone can design some tests I have a bunch of Intel hardware around: NHM, WSM, SBR, IVB, and HSW. Not sure I have anything older than Nehalem, but I'm not sure anyone cares about code generation on ancient hardware.

From my testing using my n-body test, I've found the performance benefits are remarkable, and the accuracy when comparing iterations between '-O3 -ffast-math' and '-O3 -mfpmath=387' showed that the margin of error was incredibly small. I'll try to characterize these two statements with Hard Facts(tm) once I re-run and record some data. But this test alone is probably not enough to satisfy the questions regarding codegen diversity.

rotateright commented 9 years ago

I'd like to reopen this issue to ask why FeatureUseSqrtEst and FeatureUseRecipEst were only enabled for Jaguar. It has demonstrable benefits on many if not all of the other x86 microarchitectures. Is there a reason it cannot be enabled more broadly? ... I don't think that the accuracy concerns are particularly significant (even if the estimates are slightly less accurate than the AMD implementations). In reality, I think that someone just needs to do the benchmarking on these systems to show that it is worthwhile.

I think this can all be done with command-line args at this point with the in-tree code.

That's correct. And I agree that both rcp and rsqrt can provide significant throughput perf benefits on most (all?) recent x86 chips.

But I didn't want to overstep with the initial patch, so I proposed and checked in the changes to enable these by default on Jaguar only because they clearly provide a throughput and latency win there. The only way that I can think of where using the estimates could cause a loss on Jaguar is via increased register pressure/usage.

Each of rsqrt and rcp codegen can be enabled via a subtarget feature flag, so someone just needs to run benchmarks that they care about on chips that they care about to change the default settings.

Between the number of N-R iterations, N-R implementation, FMA, and HW accuracy, there are many permutations of the codegen that should still be tested. Some of that experimenting will require changing code in LLVM, but I think all of the info and tools needed for that are in this bug report. I can provide assistance, but that work is out of my current scope.

hfinkel commented 9 years ago

I'd like to reopen this issue to ask why FeatureUseSqrtEst and FeatureUseRecipEst were only enabled for Jaguar. It has demonstrable benefits on many if not all of the other x86 microarchitectures. Is there a reason it cannot be enabled more broadly?

...

At the very least I'd like to see reciprocal square root estimates added by default on x86 for -ffast-math. How can we get such a change implemented? Is it a matter of building confidence in the safety and benefit of such a change?

I don't think that the accuracy concerns are particularly significant (even if the estimates are slightly less accurate than the AMD implementations). In reality, I think that someone just needs to do the benchmarking on these systems to show that it is worthwhile.

I think this can all be done with command-line args at this point with the in-tree code.

tycho commented 9 years ago

I'd like to reopen this issue to ask why FeatureUseSqrtEst and FeatureUseRecipEst were only enabled for Jaguar. It has demonstrable benefits on many if not all of the other x86 microarchitectures. Is there a reason it cannot be enabled more broadly?

I'm not sure what GCC's criteria is for enabling it, but I've seen reciprocal square root estimate enabled on every x86 -march= I know of whenever -ffast-math is specified and SSE is available. For example, even as far back as -march=pentium3 rsqrtss is used:

float rsqrtf(float f) { return 1.0f / sqrtf(f); }

$ gcc -m32 -O3 -ffast-math -mfpmath=sse -march=pentium3 -S -o - rsqrt.c [...] rsqrtf:

    subl    $4, %esp
    rsqrtss 8(%esp), %xmm1
    movss   8(%esp), %xmm0
    mulss   %xmm1, %xmm0
    mulss   %xmm1, %xmm0
    mulss   .LC1, %xmm1
    addss   .LC0, %xmm0
    mulss   %xmm1, %xmm0
    movss   %xmm0, (%esp)
    flds    (%esp)
    popl    %eax
    ret

.LC0: .long 3225419776 .LC1: .long 3204448256

GCC does not, however, emit reciprocal estimates, even with -march=haswell. It's possible that GCC does not implement any selection of RCPSS:

float recipf(float f) { return 1.0f / f; }

$ gcc -O3 -ffast-math -mfpmath=sse -march=haswell -S -o - recip.c [...] vmovss .LC0(%rip), %xmm1 vdivss %xmm0, %xmm1, %xmm0 ret .LC0: .long 1065353216

At the very least I'd like to see reciprocal square root estimates added by default on x86 for -ffast-math. How can we get such a change implemented? Is it a matter of building confidence in the safety and benefit of such a change?

rotateright commented 10 years ago

Hi Reid,

I wasn't aware that we were trying to get rid of these, so I was using the existing cl::opt code that I've seen in the vectorizers and backend as a model.

Certainly, if there's a better way, I'll change this. Is there an existing param that you recommend using as a template for this sort of code?

rnk commented 10 years ago

I have no context for any of the floating point precision stuff, but is a backend flag really the right way to expose this? I thought we were about to launch a witch hunt to remove all these flags that create hidden global state.

Are you sure you don't want an instruction, module, or function flag?

RKSimon commented 10 years ago

Is it worth adding a test that checks that FMA instructions are being used on FMA3/FMA4 targets?

rotateright commented 10 years ago

Refinement accuracy made configurable via cl::opt: http://llvm.org/viewvc/llvm-project?view=revision&revision=221819

With that, I think we can resolve this bug as fixed.

The customer can control generation of rcpps/rcpss with "-use-recip-est" and control the speed vs. accuracy trade-off with "-x86-recip-refinement-steps".

rotateright commented 10 years ago

With FMA3 or FMA4 enabled are the FMA instructions getting used in the fast-math NR iteration automatically or is it something that has to be explicitly set?

FMA should be automatically generated when -ffast-math is used.

For the test program, I made sure I wasn't replacing the actual div instructions or using any other FP shortcuts by compiling with:

$ clang recip_accuracy_tester.c -O2 -march=core-avx2 -ffp-contract=fast

RKSimon commented 10 years ago

With FMA3 or FMA4 enabled are the FMA instructions getting used in the fast-math NR iteration automatically or is it something that has to be explicitly set?

rotateright commented 10 years ago

For the record, I ran the attached reciprocal test program on Haswell (AVX2), and it looks like Intel did not alter the estimate function vs. Sandy Bridge. These are identical results to SB:

$ ./recip_avx2_no_fma Total tests = 2113929216

Inexact results with one N-R step = 657670214 Estimate missed by 1 ULP: 647462754 Estimate missed by 2 ULP: 10186690 Estimate missed by 3 ULP: 20770 Estimate missed by 4 ULP: 0 Estimate missed by >= 5 ULP with one N-R step = 0

Inexact results with two N-R steps = 353480402 Estimate missed by 1 ULP: 353480402 Estimate missed by 2 ULP: 0 Estimate missed by 3 ULP: 0 Estimate missed by 4 ULP: 0 Estimate missed by >= 5 ULP with two N-R steps = 0

But when we use FMA for the N-R refinement (the 4-step implementation just becomes 2 FMAs), we get much more accurate results:

$ ./recip_avx2_fma Total tests = 2113929216

Inexact results with one N-R step = 331401168 Estimate missed by 1 ULP: 330208704 Estimate missed by 2 ULP: 1192464 Estimate missed by 3 ULP: 0 Estimate missed by 4 ULP: 0 Estimate missed by >= 5 ULP with one N-R step = 0

Inexact results with two N-R steps = 252 Estimate missed by 1 ULP: 252 Estimate missed by 2 ULP: 0 Estimate missed by 3 ULP: 0 Estimate missed by 4 ULP: 0 Estimate missed by >= 5 ULP with two N-R steps = 0

rotateright commented 10 years ago

Allow rcp* codegen for btver2 checked in with: http://reviews.llvm.org/rL221706

I still need to expose the number of refinement steps as a command-line option.

We may also want to offer multiple N-R implementations as an option (as we did for rsqrt).

rotateright commented 10 years ago

Patch proposal for review: http://reviews.llvm.org/D6175

rotateright commented 10 years ago

The final question - at least in my mind - is: what is the output around 1.0f? gcc cited that as part of the reason for not enabling recip with -ffast-math, but that doesn't match the internal discussion or the actual behavior; they do generate rcpps (vector) with -ffast-math, so not generating rcpss (scalar) for accuracy reasons does not follow. The 2 instructions should have identical output on all HW.

input 0x3f7fffff: 9.999999e-01 ieee = 1.000000e+00 (0x3f800001) est_nr1 = 1.000000e+00 (0x3f800000) est_nr2 = 1.000000e+00 (0x3f800000)

input 0x3f800000: 1.000000e+00 ieee = 1.000000e+00 (0x3f800000) est_nr1 = 9.999999e-01 (0x3f7fffff) est_nr2 = 1.000000e+00 (0x3f800000)

input 0x3f800001: 1.000000e+00 ieee = 9.999999e-01 (0x3f7ffffe) est_nr1 = 9.999998e-01 (0x3f7ffffd) est_nr2 = 9.999999e-01 (0x3f7ffffe)

Using one N-R step, we're off by 1 ULP in all 3 cases. The second N-R step corrects the second and third cases to match the IEEE result.

This is using the 4 op N-R sequence on Jaguar.

So I agree with Hal's statement in comment 4 that we should offer the number of N-R refinement steps as a parameter to the user. They can weigh the perf vs. accuracy trade-off per application. I would default to one N-R step.

rotateright commented 10 years ago

Now back to the perf question. Let me focus just on Jaguar for the moment because we'll want to enable this per-CPU for x86 as we did with rsqrt and having a feature flag for this codegen provides an easy opt-out or -in for users.

The divss instruction (IEEE-compliant) takes 14 cycles (not pipelined of course). The rcpss instruction alone takes 2 cycles. The rcpss inst + one NR step using mul/sub/mul/add takes 12 cycles. The rcpss inst + two NR steps using the same mul/sub/mul/add sequence takes 22 cycles.

The breakdown of the refinement cost is: 2 cycles for mul, 3 cycles for add/sub (how this chip can multiply faster than add is a different story :) ). For the 4 instruction N-R implementation, all ops are pipelined but dependent, so we just add up the individual cycle counts.

So we have a clear win for using a single N-R step with -ffast-math: it's faster in raw latency and 14x faster for throughput because all instructions are pipelined. Using 2 N-R steps means we have to weigh the latency vs. throughput scenarios.

rotateright commented 10 years ago

The total misses are much higher, but we actually have slightly fewer 3 ULP misses with the single refinement step. Probably not worth saving a flop for that?

Disregard - too many data points! I was looking at the Sandy Bridge number (20770) rather than the Jaguar number (5481). Using the longer equation appears to produce better results in all head-to-head categories for Jaguar.

rotateright commented 10 years ago

There are at least two additional variables that we may want to consider:

  1. The implementation of the refinement algorithm matters. For example, we can save an fmul on the reciprocal N-R step by coding the N-R equation as:

          est1 = est0 * (2.0f - est0 * f);

instead of: est1 = est0 + est0 (1.0f - est0 f);

But this leads to a less accurate final answer according to: http://en.wikipedia.org/wiki/Division_algorithm#Fast_division_methods

And testing on Jaguar appears to confirm that:

Total tests = 2113929216

Inexact results with one N-R step = 833587524 Estimate missed by 1 ULP: 819988344 Estimate missed by 2 ULP: 13579776 Estimate missed by 3 ULP: 19404 Estimate missed by 4 ULP: 0 Estimate missed by >= 5 ULP with one N-R step = 0

Inexact results with two N-R steps = 617003604 Estimate missed by 1 ULP: 616978404 Estimate missed by 2 ULP: 25200 Estimate missed by 3 ULP: 0 Estimate missed by 4 ULP: 0 Estimate missed by >= 5 ULP with two N-R steps = 0

The total misses are much higher, but we actually have slightly fewer 3 ULP misses with the single refinement step. Probably not worth saving a flop for that?

  1. Everything should be retested on a machine (like Haswell) that has FMA. The added precision of FMA should mean better results, and it's possible that the initial estimate has improved too (certainly hope it hasn't gotten worse).
rotateright commented 10 years ago

So Jaguar's estimate function looks better, particularly for rsqrt. It doesn't miss either estimate by more than 3 ULPs using one refinement step. Sandy Bridge only gets to within 4 ULPs on rsqrt for some cases.

I think the good news is that at least we're within 4 ULPs in all cases.

If we judge the recip estimate purely on this accuracy metric, the single refinement result for it is better than what we (and other compilers) already accept as good enough for rsqrt (where we use just one refinement step).

rotateright commented 10 years ago

And the same programs running on Jaguar:

Reciprocal estimate:

Total tests = 2113929216

Inexact results with one N-R step = 646439942 Estimate missed by 1 ULP: 641272812 Estimate missed by 2 ULP: 5161649 Estimate missed by 3 ULP: 5481 Estimate missed by 4 ULP: 0 Estimate missed by >= 5 ULP with one N-R step = 0

Inexact results with two N-R steps = 374437446 Estimate missed by 1 ULP: 374437446 Estimate missed by 2 ULP: 0 Estimate missed by 3 ULP: 0 Estimate missed by 4 ULP: 0 Estimate missed by >= 5 ULP with two N-R steps = 0

Rsqrt estimate:

Total tests = 2130706432

Inexact results with one N-R step = 793771474 Estimate missed by 1 ULP: 773089565 Estimate missed by 2 ULP: 20629077 Estimate missed by 3 ULP: 52832 Estimate missed by 4 ULP: 0 Estimate missed by >= 5 ULP with one N-R step = 0

Inexact results with two N-R steps = 723046956 Estimate missed by 1 ULP: 714982537 Estimate missed by 2 ULP: 8064404 Estimate missed by 3 ULP: 15 Estimate missed by 4 ULP: 0 Estimate missed by >= 5 ULP with two N-R steps = 0

rotateright commented 10 years ago

Before getting to the perf question, I want to make sure we understand the accuracy angle.

Here are the results from the attached test loops for recip and rsqrt running on Sandy Bridge:

Reciprocal estimate:

Total tests = 2113929216

Inexact results with one N-R step = 657670214 Estimate missed by 1 ULP: 647462754 Estimate missed by 2 ULP: 10186690 Estimate missed by 3 ULP: 20770 Estimate missed by 4 ULP: 0 Estimate missed by >= 5 ULP with one N-R step = 0

Inexact results with two N-R steps = 353480402 Estimate missed by 1 ULP: 353480402 Estimate missed by 2 ULP: 0 Estimate missed by 3 ULP: 0 Estimate missed by 4 ULP: 0 Estimate missed by >= 5 ULP with two N-R steps = 0

Rsqrt estimate:

Total tests = 2130706432

Inexact results with one N-R step = 886838505 Estimate missed by 1 ULP: 830844699 Estimate missed by 2 ULP: 53541827 Estimate missed by 3 ULP: 2399100 Estimate missed by 4 ULP: 52879 Estimate missed by >= 5 ULP with one N-R step = 0

Inexact results with two N-R steps = 758708881 Estimate missed by 1 ULP: 742129840 Estimate missed by 2 ULP: 16578375 Estimate missed by 3 ULP: 666 Estimate missed by 4 ULP: 0 Estimate missed by >= 5 ULP with two N-R steps = 0

rotateright commented 10 years ago

rsqrt_accuracy_tester.c

rotateright commented 10 years ago

recip_accuracy_tester.c

hfinkel commented 10 years ago

On second thought...I wrote a similar test loop for the rsqrt estimate, and we don't get within 1 ULP in many cases, but that estimate is apparently generally accepted.

For "fast math" we can probably do with 1-2 ulps. This, of course, is also really application dependent.

This leads to the more general question: how accurate does fast-math need to be relative to IEEE math?

The real reason that gcc differs on rsqrt and recip optimizations has to do with passing a SPEC test rather than accuracy: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55760#c1

We should likely make this configurable (at least via a command-line option). How much does the performance change if you use two?

rotateright commented 10 years ago

On second thought...I wrote a similar test loop for the rsqrt estimate, and we don't get within 1 ULP in many cases, but that estimate is apparently generally accepted.

This leads to the more general question: how accurate does fast-math need to be relative to IEEE math?

The real reason that gcc differs on rsqrt and recip optimizations has to do with passing a SPEC test rather than accuracy: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55760#c1

rotateright commented 10 years ago

Test program output on Jaguar:

Total tests = 2113929216 Inexact results with one N-R step = 646439942 Results off by > 1 ULP with one N-R step = 5167130 Inexact results with two N-R steps = 374437446 Results off by > 1 ULP with two N-R steps = 0

rotateright commented 10 years ago

Test program output on Sandy Bridge:

Total tests = 2113929216 Inexact results with one N-R step = 657670214 Results off by > 1 ULP with one N-R step = 10207460 Inexact results with two N-R steps = 353480402 Results off by > 1 ULP with two N-R steps = 0