[InstCombine] Canonicalize reassoc contract fmuladd to fmul + fadd

vfdff commented 2 weeks ago

According to https://llvm.org/docs/LangRef.html#fast-math-flags, I think only contract and reassoc is required for this transformation. a) contract: fusing a multiply followed by an addition into a fused multiply-and-add b) reassoc: reassociation transformations when there may be other fp operations.

Fix https://github.com/llvm/llvm-project/issues/90379

llvmbot commented 2 weeks ago

@llvm/pr-subscribers-llvm-transforms

Author: Allen (vfdff)

Changes

According https://llvm.org/docs/LangRef.html#fast-math-flags, I think only contract and reassoc is required for this transformation. a) contract: fusing a multiply followed by an addition into a fused multiply-and-add b) reassoc: reassociation transformations when there may be other fp operations. Fix https://github.com/llvm/llvm-project/issues/90379 --- Full diff: https://github.com/llvm/llvm-project/pull/90434.diff 2 Files Affected: - (modified) llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp (+7-7) - (modified) llvm/test/Transforms/InstCombine/fma.ll (+28) ``````````diff diff --git a/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp b/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp index e5652458f150b5..bf5472f4fd1e4f 100644 --- a/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp +++ b/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp @@ -2389,12 +2389,13 @@ Instruction *InstCombinerImpl::visitCallInst(CallInst &CI) { break; } case Intrinsic::fmuladd: { - // Canonicalize fast fmuladd to the separate fmul + fadd. - if (II->isFast()) { + // Canonicalize reassoc contract fmuladd to the separate fmul + fadd. + FastMathFlags FMF = II->getFastMathFlags(); + if (FMF.allowReassoc() && FMF.allowContract()) { BuilderTy::FastMathFlagGuard Guard(Builder); - Builder.setFastMathFlags(II->getFastMathFlags()); - Value *Mul = Builder.CreateFMul(II->getArgOperand(0), - II->getArgOperand(1)); + Builder.setFastMathFlags(FMF); + Value *Mul = + Builder.CreateFMul(II->getArgOperand(0), II->getArgOperand(1)); Value *Add = Builder.CreateFAdd(Mul, II->getArgOperand(2)); Add->takeName(II); return replaceInstUsesWith(*II, Add); @@ -2402,8 +2403,7 @@ Instruction *InstCombinerImpl::visitCallInst(CallInst &CI) { // Try to simplify the underlying FMul. if (Value *V = simplifyFMulInst(II->getArgOperand(0), II->getArgOperand(1), - II->getFastMathFlags(), - SQ.getWithInstruction(II))) { + FMF, SQ.getWithInstruction(II))) { auto *FAdd = BinaryOperator::CreateFAdd(V, II->getArgOperand(2)); FAdd->copyFastMathFlags(II); return FAdd; diff --git a/llvm/test/Transforms/InstCombine/fma.ll b/llvm/test/Transforms/InstCombine/fma.ll index cf3d7f3c525a5f..42e584e9ef9a1b 100644 --- a/llvm/test/Transforms/InstCombine/fma.ll +++ b/llvm/test/Transforms/InstCombine/fma.ll @@ -204,6 +204,34 @@ define float @fmuladd_fneg_x_fneg_y_fast(float %x, float %y, float %z) { ret float %fmuladd } +define float @fmuladd_unfold(float %x, float %y, float %z) { +; CHECK-LABEL: @fmuladd_unfold( +; CHECK-NEXT: [[TMP1:%.*]] = fmul reassoc contract float [[X:%.*]], [[Y:%.*]] +; CHECK-NEXT: [[FMULADD:%.*]] = fadd reassoc contract float [[TMP1]], [[Z:%.*]] +; CHECK-NEXT: ret float [[FMULADD]] +; + %fmuladd = call reassoc contract float @llvm.fmuladd.f32(float %x, float %y, float %z) + ret float %fmuladd +} + +define float @fmuladd_unfold_missing_reassoc(float %x, float %y, float %z) { +; CHECK-LABEL: @fmuladd_unfold_missing_reassoc( +; CHECK-NEXT: [[FMULADD:%.*]] = call contract float @llvm.fmuladd.f32(float [[X:%.*]], float [[Y:%.*]], float [[Z:%.*]]) +; CHECK-NEXT: ret float [[FMULADD]] +; + %fmuladd = call contract float @llvm.fmuladd.f32(float %x, float %y, float %z) + ret float %fmuladd +} + +define float @fmuladd_unfold_missing_contract(float %x, float %y, float %z) { +; CHECK-LABEL: @fmuladd_unfold_missing_contract( +; CHECK-NEXT: [[FMULADD:%.*]] = call reassoc float @llvm.fmuladd.f32(float [[X:%.*]], float [[Y:%.*]], float [[Z:%.*]]) +; CHECK-NEXT: ret float [[FMULADD]] +; + %fmuladd = call reassoc float @llvm.fmuladd.f32(float %x, float %y, float %z) + ret float %fmuladd +} + define float @fmuladd_unary_fneg_x_unary_fneg_y_fast(float %x, float %y, float %z) { ; CHECK-LABEL: @fmuladd_unary_fneg_x_unary_fneg_y_fast( ; CHECK-NEXT: [[TMP1:%.*]] = fmul fast float [[X:%.*]], [[Y:%.*]] ``````````

andykaylor commented 2 weeks ago

I'm curious how you arrived at this case. I wouldn't expect clang to generate the llvm.fmuladd intrinsic in the fp-contract=fast state. If another front end is doing it, I wonder if they intended semantics other than what they will actually get. here. Or perhaps some transformation incorrectly set the 'contract' flag.

vfdff commented 1 week ago

hi @andykaylor, do you mean the @llvm.fmuladd.f32 should not be associated with a separate contract flag or even it should not be associated with fast flag ? Does it because the @llvm.fmuladd.f32 itself imply the contract flag, and that is duplicated to have a contract flag ？

arsenm commented 1 week ago

hi @andykaylor, do you mean the @llvm.fmuladd.f32 should not be associated with a separate contract flag or even it should not be associated with fast flag ? Does it because the @llvm.fmuladd.f32 itself imply the contract flag, and that is duplicated to have a contract flag ？

fmuladd implies contract inside of itself, but not with its neighbors. It's kind of like arithmetic.fence(fadd contract (fmul contract)). The contract on the fmuladd itself implies contract with the neighbors, so it's not quite the same

vfdff commented 1 week ago

test: https://gcc.godbolt.org/z/bhj9zaxW8
```
double fnmadd( double a[], double b[], double c[], int n)
{
double d = 0.0;
for (int i=0; i<n; i++)
  d += a[i] * c[i] + b[i];
return d;
}
```
- Thank you @arsenm for your clarification. I also tested above case, and it only generate @llvm.fmuladd.v2f64 when there is no -ffast-math option, so I think that is what @andykaylor expected, ie: clang should not generate the llvm.fmuladd intrinsic in the fp-contract=fast state, and fast @llvm.fmuladd.v2f64 in IR is also not expected? not only the contract @llvm.fmuladd.v2f64 ?

andykaylor commented 1 week ago

so I think that is what @andykaylor expected, ie: clang should not generate the llvm.fmuladd intrinsic in the fp-contract=fast state, and fast @llvm.fmuladd.v2f64 in IR is also not expected? not only the contract @llvm.fmuladd.v2f64 ?

Perhaps what I said was unclear. It is perfectly legal in the IR to have the fast flag or the contract flag attached to the llvm.fmuladd intrinsic, and the canonicalization you are proposing here is good in those case. In fact, I don't think it's necessary to even check for the presence of the reassoc flag. What I was saying is that, as far as I know, clang does not generate llvm.fmuladd() in the fp-contract=fast state, and so if we are seeing that in the IR in a case where clang was used as the front-end, it probably indicates a bug somewhere in an intermediate optimization.

If some other front end is generating llvm.fmuladd() with the contract flag set, I'd want to understand what they intended to accomplish by generating such IR. It seems like they may have actually intended llvm.fma() which should not be split.

The canonicalization is good, but I would prefer to track down the source of the unexpected IR before the change is made to hide a potential problem. It might even be good to emit an optimization remark saying we've done this, because I don't think there is any good reason for this IR to have been used.

jcranmer-intel commented 1 week ago

According to https://llvm.org/docs/LangRef.html#fast-math-flags, I think only contract and reassoc is required for this transformation. a) contract: fusing a multiply followed by an addition into a fused multiply-and-add b) reassoc: reassociation transformations when there may be other fp operations.

This isn't covered well by any of the existing flags. fmuladd is basically "fma, but if there's no hardware instruction, give me fmul + fadd instead". contract generally implies a one-way transformation to a better form, so contract fma can't yield fmul + fadd. But also, as arsenm notes, contract fmul + fadd also allows more transformations than just forming fma; fmuladd implies just a single possible transformation, hence why a separate intrinsic is needed.

Thinking in my head through every possibility, contract should be the only flag needed to allow fmuladd to be represented as fma; reassoc is irrelevant.

In any case, you should also have tests for the propagation of non-necessary FMF flags to the expanded instructions.

andykaylor commented 1 week ago

fmuladd is basically "fma, but if there's no hardware instruction, give me fmul + fadd instead"

I don't know if I'd agree with that interpretation. I think it's more open-ended than that. The definition specifically says that fusion is not guaranteed, even if the target platform supports it. As such, transforming llvm.fmuladd into separate fmul and fadd operations is always legal (though not necessarily profitable). The question here is whether it is also legal to attach the contract flag to those operations after they have been split out.

Does contract on an intrinsic only give permission to contract that entire operation with another operation? Or does it also imply contract semantics with regard to the "inner" operations implied by the intrinsic? I would lean towards the latter, but I don't think it's clear.

While we're on the topic, the "overview" section of the llvm.fmuladd intrinsic says this:

The ‘llvm.fmuladd.*’ intrinsic functions represent multiply-add expressions that can be fused if the code generator determines that (a) the target instruction set has support for a fused operation, and (b) that the fused operation is more efficient than the equivalent, separate pair of mul and add instructions.

I've added emphasis on the first condition, because we aren't currently respecting this condition in the constant folder. The constant folder will fold llvm.fmuladd as if it were a fused operation even if the target instruction set does not support FMA. I think this is incorrect, but I'm bringing it up here because it means that the proposed canonicalization will have a potentially observable effect. The constant folder does not evaluate separate fmul and fadd operations using fused semantics even if they have the contract flag set, but it does do so with llvm.fmuladd.

If the target instruction set supports FMA, constant folding these operations without intermediate rounding may be preferred. Of course, the contract flag does specifically allow value-changing transformations, so maybe this is OK either way when that flag is present.

arsenm commented 1 week ago

I've added emphasis on the first condition, because we aren't currently respecting this condition in the constant folder. The constant folder will fold llvm.fmuladd as if it were a fused operation even if the target instruction set does not support FMA.

I think we should just drop the first line in the description. The intention and implementation is to do whatever is contextually faster. A target may only choose to fuse specific callsites, and isn't required to consistently fuse every call. We can't define the IR in terms of ISA features, and just because an ISA has the features does not mean the fused operations are always faster.

andykaylor commented 1 week ago

I've added emphasis on the first condition, because we aren't currently respecting this condition in the constant folder. The constant folder will fold llvm.fmuladd as if it were a fused operation even if the target instruction set does not support FMA.

I think we should just drop the first line in the description. The intention and implementation is to do whatever is contextually faster. A target may only choose to fuse specific callsites, and isn't required to consistently fuse every call. We can't define the IR in terms of ISA features, and just because an ISA has the features does not mean the fused operations are always faster.

Maybe the front end shouldn't be generating llvm.fmuladd() if the target doesn't support FMA. The problem I have here is that it's difficult to explain to a customer that they didn't get consistent numeric results because of fused operations when they're compiling for a target that doesn't support FMA. You might ask, why is the user enabling contraction if they are compiling for a target that doesn't support contraction? The reason is that the default behavior of clang is FP_CONTRACT ON.

Here's something that looks like a clear bug in the front end -- the front end will generate the llvm.fmuladd() intrinsic even if I've explicitly disabled FMA with the -mno-fma option!

https://godbolt.org/z/xab8Pa779

If you're compiling for a target that does support FMA and you have contract enabled, I think it's fine for the optimizer to constant fold it either way, but if we can be certain that the operation will never be fused at execution time, the optimizer shouldn't be folding it without intermediate rounding.

arsenm commented 1 week ago

Here's something that looks like a clear bug in the front end -- the front end will generate the llvm.fmuladd() intrinsic even if I've explicitly disabled FMA with the -mno-fma option!

I say this is not a bug. The -m flags change the behavior of the machine/codegen, and in this case no FMA instruction was emitted as requested. It was a mechanical request to not emit the FMA instruction. It was not a semantic request to not fuse operations (i.e. not a -f flag)

andykaylor commented 1 week ago

Here's something that looks like a clear bug in the front end -- the front end will generate the llvm.fmuladd() intrinsic even if I've explicitly disabled FMA with the -mno-fma option!

I say this is not a bug. The -m flags change the behavior of the machine/codegen, and in this case no FMA instruction was emitted as requested. It was a mechanical request to not emit the FMA instruction. It was not a semantic request to not fuse operations (i.e. not a -f flag)

The C standard says (in a footnote in 6.5) this about the allowance of contracted operation: "This license is specifically intended to allow implementations to exploit fast machine instructions that combine multiple C operators." So if the user is disabling the instructions, why should the compiler still be able to contract the expression? There is no speed benefit, and it decreases numerical consistency. The standard does leave this as implementation-defined behavior and it is basically governed by the FP_CONTRACT state, but it seems to me that it would be more reasonable for the FP_CONTRACT state to default to off when the target doesn't support contracted operations, especially when that support is specifically disabled by user input.

This is getting off track from the current PR, of course. I would still like to hear more about how the situation with the contract flag set on an llvm.fmuladd intrinsic occurs, if @vfdff is at liberty to say.

vfdff commented 1 week ago

Hi @andykaylor, sorry for the late reply. I generate the origin IR from classic flang is some thing like https://gcc.godbolt.org/z/vdhnEq9rY (with a fast flag).

%add5 = call fast double @llvm.fmuladd.f64 (double %0, double %1, double %add)

As I think the fast flag is too strict for the current conversion（fmulall -> fmul + fadd）, so I change the fast to reassoc contract, and that is the input IR dectription in this topic (PR90379).

vfdff commented 3 days ago

It is disscussed in the improving-ir-fast-math-semantics, but don't have final conclusion

llvm / llvm-project

[InstCombine] Canonicalize reassoc contract fmuladd to fmul + fadd #90434