Regression in code quality for horizontal add after 70a54bca6f

dyung commented 1 month ago

We have an internal test which tests whether the compiler generates horizontal add instructions for certain cases. Recently we noticed that one of the cases, the code generated seems to have gotten worse after a recent change 70a54bca6f.

Consider the following code:

__attribute__((noinline))
__m256d add_pd_004(__m256d a, __m256d b) {
  __m256d r = (__m256d){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
  return __builtin_shufflevector(r, a, 0, -1, -1, 3);
}

If compiled with optimizations targeting btver2 (-S -O2 -march=btver2), the compiler previously generated the following code:

        vhaddpd ymm0, ymm0, ymm1
        ret

But after 70a54bca6f, the compiler now is generating worse code:

        vextractf128    xmm1, ymm1, 1
        vhaddpd xmm0, xmm0, xmm1
        vinsertf128     ymm0, ymm0, xmm0, 1
        ret

@alexey-bataev, this was your change, can you take a look to see if there is a way we can avoid the regression in code quality in this case?

dyung commented 1 month ago

Godbolt link: https://godbolt.org/z/71Pnv5T7T

alexey-bataev commented 1 month ago

LLVM IR is better: 18.1.0

define dso_local noundef <4 x double> @add_pd_004(double vector[4], double vector[4])(<4 x double> noundef %a, <4 x double> noundef %b) local_unnamed_addr {
entry:
  %vecext = extractelement <4 x double> %a, i64 0
  %vecext1 = extractelement <4 x double> %a, i64 1
  %add = fadd double %vecext, %vecext1
  %0 = insertelement <4 x double> poison, double %add, i64 0
  %vecext10 = extractelement <4 x double> %b, i64 2
  %vecext11 = extractelement <4 x double> %b, i64 3
  %add12 = fadd double %vecext10, %vecext11
  %shuffle = insertelement <4 x double> %0, double %add12, i64 3
  ret <4 x double> %shuffle
}

Trunc:

define dso_local noundef <4 x double> @add_pd_004(double vector[4], double vector[4])(<4 x double> noundef %a, <4 x double> noundef %b) local_unnamed_addr {
entry:
  %0 = shufflevector <4 x double> %a, <4 x double> %b, <2 x i32> <i32 0, i32 6>
  %1 = shufflevector <4 x double> %a, <4 x double> %b, <2 x i32> <i32 1, i32 7>
  %2 = fadd <2 x double> %0, %1
  %3 = shufflevector <2 x double> %2, <2 x double> poison, <4 x i32> <i32 0, i32 poison, i32 poison, i32 1>
  ret <4 x double> %3
}

declare void @llvm.dbg.value(metadata, metadata, metadata) #1

Looks like codegen previously recognized the pattern, but currently not

RKSimon commented 3 weeks ago

But this would have been even simpler:

define <4 x double> @add_pd_004(<4 x double> noundef %a, <4 x double> noundef %b)  {
entry:
  %0 = shufflevector <4 x double> %a, <4 x double> %b, <4 x i32> <i32 0, i32 poison, i32 poison, i32 6>
  %1 = shufflevector <4 x double> %a, <4 x double> %b, <4 x i32> <i32 1, i32 poison, i32 poison, i32 7>
  %2 = fadd <4 x double> %0, %1
  ret <4 x double> %2
}

alexey-bataev commented 3 weeks ago

Still, looks like some kind of cost issue, will double check in a couple weeks

RKSimon commented 3 weeks ago

Yes, its purely down to the shuffle costs not recognizing that the v4f64 shuffle mask doesn't cross 128-bit lanes so the worst case cost is much higher than necessary

RKSimon commented 2 weeks ago

@alexey-bataev SLP is only calling getShuffleCost 3 times:

CostA <4 x double> <i32 0, i32 undef, i32 undef, i32 1> = 2
CostB <2 x double> <i32 0, i32 2> = 1
CostC <2 x double> <i32 1, i32 3> = 1

But then creates:

  %0 = shufflevector <4 x double> %a, <4 x double> %b, <2 x i32> <i32 0, i32 6>
  %1 = shufflevector <4 x double> %a, <4 x double> %b, <2 x i32> <i32 1, i32 7>
  %2 = fadd <2 x double> %0, %1
  %3 = shufflevector <2 x double> %2, <2 x double> poison, <4 x i32> <i32 0, i32 poison, i32 poison, i32 1>

The costs of %0 and %1 will be different to CostB / CostC - have we lost the cost of additional extract_subvector someplace?

alexey-bataev commented 2 weeks ago

I assume so, will check next week, after PTO

alexey-bataev commented 1 week ago

Ok, I prepared a fix but looks like even with the fixed version of the code this new vectorized form still looks preferable. The problem is that the SLP vectorizer does not know that this scalar form can be converted just to vhaddpd. So, it calculates the cost of the scalars and consider them as removed in the code (along with the insertelement instructions). And after that still considers the vectorized version of the more profitable

But this would have been even simpler:

define <4 x double> @add_pd_004(<4 x double> noundef %a, <4 x double> noundef %b)  {
entry:
  %0 = shufflevector <4 x double> %a, <4 x double> %b, <4 x i32> <i32 0, i32 poison, i32 poison, i32 6>
  %1 = shufflevector <4 x double> %a, <4 x double> %b, <4 x i32> <i32 1, i32 poison, i32 poison, i32 7>
  %2 = fadd <4 x double> %0, %1
  ret <4 x double> %2
}

The vectorizer currently cannot generate this. It sees 2 insertelement instructions and operates on vectors of 2x because of that.

alexey-bataev commented 1 week ago

The cost before the fix: --- !Passed Pass: slp-vectorizer Name: VectorizedList Function: test Args:

String: 'SLP vectorized with cost '
Cost: '-4'
String: ' and with tree size '
TreeSize: '4' ...

After the fix: Pass: slp-vectorizer Name: VectorizedList Function: test Args:

String: 'SLP vectorized with cost '
Cost: '-2'
String: ' and with tree size '
TreeSize: '4'

Because there are just 2 vector extracts, they add just 1 and 1 to the cost.

llvm / llvm-project

Regression in code quality for horizontal add after 70a54bca6f #94546