Open chandlerc opened 6 years ago
Trying to handle a related transform that does not require the aggressive TLI hook: https://reviews.llvm.org/D80801
I missed this bug and the simpler case though. I think these patterns could be generalized so that any one-use chain of (fadd ({ fma } (fmul))) can be rearranged to fuse the leading fmul and trailing fadd. fma can be 0 or more nodes in between.
From clang, this requires -ffp-contract=fast, but no other FP relaxed math AFAIK.
So for this case it looks like these two checks in DAGCombiner are relevant
// fold (fadd (fmul x, y), z) -> (fma x, y, z) if (isContractableFMUL(N0) && (Aggressive || N0->hasOneUse())) { return DAG.getNode(PreferredFusedOpcode, SL, VT, N0.getOperand(0), N0.getOperand(1), N1); }
// fold (fadd x, (fmul y, z)) -> (fma y, z, x) // Note: Commutes FADD operands. if (isContractableFMUL(N1) && (Aggressive || N1->hasOneUse())) { return DAG.getNode(PreferredFusedOpcode, SL, VT, N1.getOperand(0), N1.getOperand(1), N0); }
I agree that this can be enabled on Haswell. There are other checks controlled by Aggressive that may not be a good idea.
turning this mul + add + max into mul + fma + max would be a win, even on HSW/BDW.
It's not so much about the relative cost of add vs fma (although that does make thinking about it easier on SKL). The add has an input dependency on the mul, while the FMA does not. The mul and fma will execute concurrently, so as long as latency(fma) < latency(mul)+latency(add) this will be a win. (throughput for all for all of these is at least 1 per cycle)
testcase:
double foo(double A, double B, double C)
{
double D;
double E;
D = A * B + C;
E = A * B;
if (D > E)
return D;
else
return E;
}
compiler output:
vmulsd %xmm0, %xmm1, %xmm0
vaddsd %xmm2, %xmm0, %xmm1
vmaxsd %xmm0, %xmm1, %xmm0
retq
Oh, and hat tip to Arjan for noticing this: https://twitter.com/fenruspdx/status/976135565484503040
Current Codegen: https://godbolt.org/z/En64ac93x
clang -g0 -O3 -march=znver1
_Z3fooddd: # @_Z3fooddd
vfmadd231sd %xmm1, %xmm0, %xmm2 # xmm2 = (xmm0 * xmm1) + xmm2
vmulsd %xmm1, %xmm0, %xmm0
vmaxsd %xmm0, %xmm2, %xmm0
retq
This is being performed entirely independently of the enableAggressiveFMAFusion subtarget option - we don't appear to perform the equivalent fold purely within llc: https://simd.godbolt.org/z/eGsxMGrKT
Candidate Patch: https://reviews.llvm.org/D123147
Extended Description
On skylake, vector FP add has the same latency and throughput as FMA, so we should definitely enable aggressive FMA formation.
On haswell and broadwell the issue is more complex. Add has shorter latency (3 cycles vs 5 cycles according to Agner) but half the throughput (one port rather than two ports). I suspect that this is sufficient for it to be worth using FMA everywhere, but we should benchmark to confirm.
If we can enable it everywhere, simply the following patch will suffice:
Otherwise, we'll need to factor the per-subtarget facility for this out of the AArch64 backend and use it to enable this on a per-subtarget basis for the X86 backend as well.