Open vfdff opened 2 weeks ago
@llvm/pr-subscribers-llvm-transforms
Author: Allen (vfdff)
I'm curious how you arrived at this case. I wouldn't expect clang to generate the llvm.fmuladd intrinsic in the fp-contract=fast state. If another front end is doing it, I wonder if they intended semantics other than what they will actually get. here. Or perhaps some transformation incorrectly set the 'contract' flag.
hi @andykaylor, do you mean the @llvm.fmuladd.f32
should not be associated with a separate contract
flag or even it should not be associated with fast
flag ? Does it because the @llvm.fmuladd.f32
itself imply the contract
flag, and that is duplicated to have a contract
flag ?
hi @andykaylor, do you mean the
@llvm.fmuladd.f32
should not be associated with a separatecontract
flag or even it should not be associated withfast
flag ? Does it because the@llvm.fmuladd.f32
itself imply thecontract
flag, and that is duplicated to have acontract
flag ?
fmuladd implies contract inside of itself, but not with its neighbors. It's kind of like arithmetic.fence(fadd contract (fmul contract)). The contract on the fmuladd itself implies contract with the neighbors, so it's not quite the same
double fnmadd( double a[], double b[], double c[], int n)
{
double d = 0.0;
for (int i=0; i<n; i++)
d += a[i] * c[i] + b[i];
return d;
}
so I think that is what @andykaylor expected, ie: clang should not generate the llvm.fmuladd intrinsic in the fp-contract=fast state, and fast @llvm.fmuladd.v2f64 in IR is also not expected? not only the contract @llvm.fmuladd.v2f64 ?
Perhaps what I said was unclear. It is perfectly legal in the IR to have the fast
flag or the contract
flag attached to the llvm.fmuladd intrinsic, and the canonicalization you are proposing here is good in those case. In fact, I don't think it's necessary to even check for the presence of the reassoc
flag. What I was saying is that, as far as I know, clang does not generate llvm.fmuladd() in the fp-contract=fast state, and so if we are seeing that in the IR in a case where clang was used as the front-end, it probably indicates a bug somewhere in an intermediate optimization.
If some other front end is generating llvm.fmuladd() with the contract
flag set, I'd want to understand what they intended to accomplish by generating such IR. It seems like they may have actually intended llvm.fma() which should not be split.
The canonicalization is good, but I would prefer to track down the source of the unexpected IR before the change is made to hide a potential problem. It might even be good to emit an optimization remark saying we've done this, because I don't think there is any good reason for this IR to have been used.
According to https://llvm.org/docs/LangRef.html#fast-math-flags, I think only contract and reassoc is required for this transformation. a) contract: fusing a multiply followed by an addition into a fused multiply-and-add b) reassoc: reassociation transformations when there may be other fp operations.
This isn't covered well by any of the existing flags. fmuladd
is basically "fma
, but if there's no hardware instruction, give me fmul + fadd
instead". contract
generally implies a one-way transformation to a better form, so contract fma
can't yield fmul + fadd
. But also, as arsenm notes, contract fmul + fadd
also allows more transformations than just forming fma
; fmuladd
implies just a single possible transformation, hence why a separate intrinsic is needed.
Thinking in my head through every possibility, contract
should be the only flag needed to allow fmuladd
to be represented as fma
; reassoc
is irrelevant.
In any case, you should also have tests for the propagation of non-necessary FMF flags to the expanded instructions.
fmuladd
is basically "fma
, but if there's no hardware instruction, give mefmul + fadd
instead"
I don't know if I'd agree with that interpretation. I think it's more open-ended than that. The definition specifically says that fusion is not guaranteed, even if the target platform supports it. As such, transforming llvm.fmuladd
into separate fmul
and fadd
operations is always legal (though not necessarily profitable). The question here is whether it is also legal to attach the contract
flag to those operations after they have been split out.
Does contract
on an intrinsic only give permission to contract that entire operation with another operation? Or does it also imply contract
semantics with regard to the "inner" operations implied by the intrinsic? I would lean towards the latter, but I don't think it's clear.
While we're on the topic, the "overview" section of the llvm.fmuladd
intrinsic says this:
The ‘llvm.fmuladd.*’ intrinsic functions represent multiply-add expressions that can be fused if the code generator determines that (a) the target instruction set has support for a fused operation, and (b) that the fused operation is more efficient than the equivalent, separate pair of mul and add instructions.
I've added emphasis on the first condition, because we aren't currently respecting this condition in the constant folder. The constant folder will fold llvm.fmuladd
as if it were a fused operation even if the target instruction set does not support FMA. I think this is incorrect, but I'm bringing it up here because it means that the proposed canonicalization will have a potentially observable effect. The constant folder does not evaluate separate fmul
and fadd
operations using fused semantics even if they have the contract
flag set, but it does do so with llvm.fmuladd
.
If the target instruction set supports FMA, constant folding these operations without intermediate rounding may be preferred. Of course, the contract
flag does specifically allow value-changing transformations, so maybe this is OK either way when that flag is present.
I've added emphasis on the first condition, because we aren't currently respecting this condition in the constant folder. The constant folder will fold llvm.fmuladd as if it were a fused operation even if the target instruction set does not support FMA.
I think we should just drop the first line in the description. The intention and implementation is to do whatever is contextually faster. A target may only choose to fuse specific callsites, and isn't required to consistently fuse every call. We can't define the IR in terms of ISA features, and just because an ISA has the features does not mean the fused operations are always faster.
I've added emphasis on the first condition, because we aren't currently respecting this condition in the constant folder. The constant folder will fold llvm.fmuladd as if it were a fused operation even if the target instruction set does not support FMA.
I think we should just drop the first line in the description. The intention and implementation is to do whatever is contextually faster. A target may only choose to fuse specific callsites, and isn't required to consistently fuse every call. We can't define the IR in terms of ISA features, and just because an ISA has the features does not mean the fused operations are always faster.
Maybe the front end shouldn't be generating llvm.fmuladd() if the target doesn't support FMA. The problem I have here is that it's difficult to explain to a customer that they didn't get consistent numeric results because of fused operations when they're compiling for a target that doesn't support FMA. You might ask, why is the user enabling contraction if they are compiling for a target that doesn't support contraction? The reason is that the default behavior of clang is FP_CONTRACT ON.
Here's something that looks like a clear bug in the front end -- the front end will generate the llvm.fmuladd() intrinsic even if I've explicitly disabled FMA with the -mno-fma option!
https://godbolt.org/z/xab8Pa779
If you're compiling for a target that does support FMA and you have contract enabled, I think it's fine for the optimizer to constant fold it either way, but if we can be certain that the operation will never be fused at execution time, the optimizer shouldn't be folding it without intermediate rounding.
Here's something that looks like a clear bug in the front end -- the front end will generate the llvm.fmuladd() intrinsic even if I've explicitly disabled FMA with the -mno-fma option!
I say this is not a bug. The -m flags change the behavior of the machine/codegen, and in this case no FMA instruction was emitted as requested. It was a mechanical request to not emit the FMA instruction. It was not a semantic request to not fuse operations (i.e. not a -f flag)
Here's something that looks like a clear bug in the front end -- the front end will generate the llvm.fmuladd() intrinsic even if I've explicitly disabled FMA with the -mno-fma option!
I say this is not a bug. The -m flags change the behavior of the machine/codegen, and in this case no FMA instruction was emitted as requested. It was a mechanical request to not emit the FMA instruction. It was not a semantic request to not fuse operations (i.e. not a -f flag)
The C standard says (in a footnote in 6.5) this about the allowance of contracted operation: "This license is specifically intended to allow implementations to exploit fast machine instructions that combine multiple C operators." So if the user is disabling the instructions, why should the compiler still be able to contract the expression? There is no speed benefit, and it decreases numerical consistency. The standard does leave this as implementation-defined behavior and it is basically governed by the FP_CONTRACT state, but it seems to me that it would be more reasonable for the FP_CONTRACT state to default to off when the target doesn't support contracted operations, especially when that support is specifically disabled by user input.
This is getting off track from the current PR, of course. I would still like to hear more about how the situation with the contract
flag set on an llvm.fmuladd intrinsic occurs, if @vfdff is at liberty to say.
Hi @andykaylor, sorry for the late reply. I generate the origin IR from classic flang is some thing like https://gcc.godbolt.org/z/vdhnEq9rY (with a fast flag).
%add5 = call fast double @llvm.fmuladd.f64 (double %0, double %1, double %add)
As I think the fast flag is too strict for the current conversion(fmulall -> fmul + fadd), so I change the fast to reassoc contract, and that is the input IR dectription in this topic (PR90379).
It is disscussed in the improving-ir-fast-math-semantics, but don't have final conclusion
According to https://llvm.org/docs/LangRef.html#fast-math-flags, I think only contract and reassoc is required for this transformation. a) contract: fusing a multiply followed by an addition into a fused multiply-and-add b) reassoc: reassociation transformations when there may be other fp operations.
Fix https://github.com/llvm/llvm-project/issues/90379