llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.07k stars 11.99k forks source link

X86 should enable aggressive FMA formation (at least on skylake) #36174

Open chandlerc opened 6 years ago

chandlerc commented 6 years ago
Bugzilla Link 36826
Version trunk
OS Linux
CC @topperc,@fhahn,@RKSimon,@rotateright

Extended Description

On skylake, vector FP add has the same latency and throughput as FMA, so we should definitely enable aggressive FMA formation.

On haswell and broadwell the issue is more complex. Add has shorter latency (3 cycles vs 5 cycles according to Agner) but half the throughput (one port rather than two ports). I suspect that this is sufficient for it to be worth using FMA everywhere, but we should benchmark to confirm.

If we can enable it everywhere, simply the following patch will suffice:

--- a/llvm/lib/Target/X86/X86ISelLowering.h
+++ b/llvm/lib/Target/X86/X86ISelLowering.h
@@ -838,6 +838,9 @@ namespace llvm {
       return 2;
     }

+    /// Force aggressive FMA fusion.
+    bool enableAggressiveFMAFusion(EVT VT) const override { return true; }
+
     /// Return the value type to use for ISD::SETCC.
     EVT getSetCCResultType(const DataLayout &DL, LLVMContext &Context,
                            EVT VT) const override;

Otherwise, we'll need to factor the per-subtarget facility for this out of the AArch64 backend and use it to enable this on a per-subtarget basis for the X86 backend as well.

rotateright commented 4 years ago

Trying to handle a related transform that does not require the aggressive TLI hook: https://reviews.llvm.org/D80801

I missed this bug and the simpler case though. I think these patterns could be generalized so that any one-use chain of (fadd ({ fma } (fmul))) can be rearranged to fuse the leading fmul and trailing fadd. fma can be 0 or more nodes in between.

From clang, this requires -ffp-contract=fast, but no other FP relaxed math AFAIK.

topperc commented 6 years ago

So for this case it looks like these two checks in DAGCombiner are relevant

// fold (fadd (fmul x, y), z) -> (fma x, y, z) if (isContractableFMUL(N0) && (Aggressive || N0->hasOneUse())) { return DAG.getNode(PreferredFusedOpcode, SL, VT, N0.getOperand(0), N0.getOperand(1), N1); }

// fold (fadd x, (fmul y, z)) -> (fma y, z, x) // Note: Commutes FADD operands. if (isContractableFMUL(N1) && (Aggressive || N1->hasOneUse())) { return DAG.getNode(PreferredFusedOpcode, SL, VT, N1.getOperand(0), N1.getOperand(1), N0); }

I agree that this can be enabled on Haswell. There are other checks controlled by Aggressive that may not be a good idea.

llvmbot commented 6 years ago

turning this mul + add + max into mul + fma + max would be a win, even on HSW/BDW.

It's not so much about the relative cost of add vs fma (although that does make thinking about it easier on SKL). The add has an input dependency on the mul, while the FMA does not. The mul and fma will execute concurrently, so as long as latency(fma) < latency(mul)+latency(add) this will be a win. (throughput for all for all of these is at least 1 per cycle)

llvmbot commented 6 years ago

testcase:

double foo(double A, double B, double C)
{
        double D;
        double E;

        D = A * B + C;
        E = A * B;

        if (D > E)
                return D;
        else
                return E;
}

compiler output:

        vmulsd  %xmm0, %xmm1, %xmm0
        vaddsd  %xmm2, %xmm0, %xmm1
        vmaxsd  %xmm0, %xmm1, %xmm0
        retq
chandlerc commented 6 years ago

Oh, and hat tip to Arjan for noticing this: https://twitter.com/fenruspdx/status/976135565484503040

RKSimon commented 2 years ago

Current Codegen: https://godbolt.org/z/En64ac93x

clang -g0 -O3 -march=znver1

_Z3fooddd: # @_Z3fooddd
  vfmadd231sd %xmm1, %xmm0, %xmm2 # xmm2 = (xmm0 * xmm1) + xmm2
  vmulsd %xmm1, %xmm0, %xmm0
  vmaxsd %xmm0, %xmm2, %xmm0
  retq

This is being performed entirely independently of the enableAggressiveFMAFusion subtarget option - we don't appear to perform the equivalent fold purely within llc: https://simd.godbolt.org/z/eGsxMGrKT

RKSimon commented 2 years ago

Candidate Patch: https://reviews.llvm.org/D123147