Open dyung opened 1 month ago
Godbolt link: https://godbolt.org/z/71Pnv5T7T
LLVM IR is better: 18.1.0
define dso_local noundef <4 x double> @add_pd_004(double vector[4], double vector[4])(<4 x double> noundef %a, <4 x double> noundef %b) local_unnamed_addr {
entry:
%vecext = extractelement <4 x double> %a, i64 0
%vecext1 = extractelement <4 x double> %a, i64 1
%add = fadd double %vecext, %vecext1
%0 = insertelement <4 x double> poison, double %add, i64 0
%vecext10 = extractelement <4 x double> %b, i64 2
%vecext11 = extractelement <4 x double> %b, i64 3
%add12 = fadd double %vecext10, %vecext11
%shuffle = insertelement <4 x double> %0, double %add12, i64 3
ret <4 x double> %shuffle
}
Trunc:
define dso_local noundef <4 x double> @add_pd_004(double vector[4], double vector[4])(<4 x double> noundef %a, <4 x double> noundef %b) local_unnamed_addr {
entry:
%0 = shufflevector <4 x double> %a, <4 x double> %b, <2 x i32> <i32 0, i32 6>
%1 = shufflevector <4 x double> %a, <4 x double> %b, <2 x i32> <i32 1, i32 7>
%2 = fadd <2 x double> %0, %1
%3 = shufflevector <2 x double> %2, <2 x double> poison, <4 x i32> <i32 0, i32 poison, i32 poison, i32 1>
ret <4 x double> %3
}
declare void @llvm.dbg.value(metadata, metadata, metadata) #1
Looks like codegen previously recognized the pattern, but currently not
But this would have been even simpler:
define <4 x double> @add_pd_004(<4 x double> noundef %a, <4 x double> noundef %b) {
entry:
%0 = shufflevector <4 x double> %a, <4 x double> %b, <4 x i32> <i32 0, i32 poison, i32 poison, i32 6>
%1 = shufflevector <4 x double> %a, <4 x double> %b, <4 x i32> <i32 1, i32 poison, i32 poison, i32 7>
%2 = fadd <4 x double> %0, %1
ret <4 x double> %2
}
Still, looks like some kind of cost issue, will double check in a couple weeks
Yes, its purely down to the shuffle costs not recognizing that the v4f64 shuffle mask doesn't cross 128-bit lanes so the worst case cost is much higher than necessary
@alexey-bataev SLP is only calling getShuffleCost 3 times:
CostA <4 x double> <i32 0, i32 undef, i32 undef, i32 1> = 2
CostB <2 x double> <i32 0, i32 2> = 1
CostC <2 x double> <i32 1, i32 3> = 1
But then creates:
%0 = shufflevector <4 x double> %a, <4 x double> %b, <2 x i32> <i32 0, i32 6>
%1 = shufflevector <4 x double> %a, <4 x double> %b, <2 x i32> <i32 1, i32 7>
%2 = fadd <2 x double> %0, %1
%3 = shufflevector <2 x double> %2, <2 x double> poison, <4 x i32> <i32 0, i32 poison, i32 poison, i32 1>
The costs of %0 and %1 will be different to CostB / CostC - have we lost the cost of additional extract_subvector someplace?
I assume so, will check next week, after PTO
Ok, I prepared a fix but looks like even with the fixed version of the code this new vectorized form still looks preferable. The problem is that the SLP vectorizer does not know that this scalar form can be converted just to vhaddpd. So, it calculates the cost of the scalars and consider them as removed in the code (along with the insertelement instructions). And after that still considers the vectorized version of the more profitable
But this would have been even simpler:
define <4 x double> @add_pd_004(<4 x double> noundef %a, <4 x double> noundef %b) { entry: %0 = shufflevector <4 x double> %a, <4 x double> %b, <4 x i32> <i32 0, i32 poison, i32 poison, i32 6> %1 = shufflevector <4 x double> %a, <4 x double> %b, <4 x i32> <i32 1, i32 poison, i32 poison, i32 7> %2 = fadd <4 x double> %0, %1 ret <4 x double> %2 }
The vectorizer currently cannot generate this. It sees 2 insertelement instructions and operates on vectors of 2x because of that.
The cost before the fix: --- !Passed Pass: slp-vectorizer Name: VectorizedList Function: test Args:
After the fix: Pass: slp-vectorizer Name: VectorizedList Function: test Args:
Because there are just 2 vector extracts, they add just 1 and 1 to the cost.
We have an internal test which tests whether the compiler generates horizontal add instructions for certain cases. Recently we noticed that one of the cases, the code generated seems to have gotten worse after a recent change 70a54bca6f.
Consider the following code:
If compiled with optimizations targeting btver2 (
-S -O2 -march=btver2
), the compiler previously generated the following code:But after 70a54bca6f, the compiler now is generating worse code:
@alexey-bataev, this was your change, can you take a look to see if there is a way we can avoid the regression in code quality in this case?