[SLPVectorizer] 2-way parallel vectorization hurts perf on x86

rotateright commented 6 years ago


Bugzilla Link	36280
Resolution	FIXED
Resolved on	Feb 27, 2018 11:01
Version	trunk
OS	All
Attachments	himeno.c source file
CC	@alexey-bataev,@dtemirbulatov,@eastig,@fhahn,@RKSimon
Fixed by commit(s)	326133

Extended Description

This is a reduction from the Himeno benchmark (attached). In the actual benchmark, we lose 6% performance on AMD Jaguar when using SLP vectorization because the extra insert/extract ops cost more than the savings from combining 2 scalar loads and 2 scalar fmul.

(p[1] x) + z + (p[2] y)

target triple = "x86_64-unknown-unknown"

define float @jacobi(float %p, float %x, float %y, float %z) { %gep1 = getelementptr float, float %p, i64 1 %gep2 = getelementptr float, float %p, i64 2 %p1 = load float, float %gep1 %p2 = load float, float* %gep2 %mul1 = fmul float %p1, %x %mul2 = fmul float %p2, %y %add1 = fadd float %mul1, %z %add2 = fadd float %mul2, %add1 ret float %add2 }

$ ./opt -slp-vectorizer minnn.ll -S

define float @jacobi(float %p, float %x, float %y, float %z) { %gep1 = getelementptr float, float %p, i64 1 %gep2 = getelementptr float, float %p, i64 2 %1 = bitcast float %gep1 to <2 x float> %2 = load <2 x float>, <2 x float> %1, align 4 %3 = insertelement <2 x float> undef, float %x, i32 0 %4 = insertelement <2 x float> %3, float %y, i32 1 %5 = fmul <2 x float> %4, %2 %6 = extractelement <2 x float> %5, i32 0 %add1 = fadd float %6, %z %7 = extractelement <2 x float> %5, i32 1 %add2 = fadd float %7, %add1 ret float %add2 }

The x86 AVX code looks like this without SLP: vmulss 4(%rdi), %xmm0, %xmm0 vmulss 8(%rdi), %xmm1, %xmm1 vaddss %xmm2, %xmm0, %xmm0 vaddss %xmm0, %xmm1, %xmm0

And this with SLP: vmovsd 4(%rdi), %xmm3 # xmm3 = mem[0],zero vinsertps $16, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0],xmm1[0],xmm0[2,3] vmulps %xmm3, %xmm0, %xmm0 vaddss %xmm2, %xmm0, %xmm1 vmovshdup %xmm0, %xmm0 # xmm0 = xmm0[1,1,3,3] vaddss %xmm1, %xmm0, %xmm0

rotateright commented 6 years ago

We can close this bug again because we changed the costs for all recent x86 with: https://reviews.llvm.org/rL326133

I'm still confused how any of this works as seen in: https://reviews.llvm.org/D43769

rotateright commented 6 years ago

Created attachment 19944 [details] thumbv8-loop-vec-test.ll

A regression test for https://reviews.llvm.org/D43079 The test is based on CFP2006 482.sphinx3.

Added here: https://reviews.llvm.org/rL326221

rotateright commented 6 years ago

Created attachment 19930 [details] tsc-s352.ll

Added to the regression tests with: https://reviews.llvm.org/rL326217

(I'm not trying very hard to minimize the tests, so if you see any extra stuff that can be hacked away, please do.)

eastig commented 6 years ago

thumbv8-loop-vec-test.ll A regression test for https://reviews.llvm.org/D43079 The test is based on CFP2006 482.sphinx3.

To run:

$ opt -loop-vectorize -o - thumbv8-loop-vec-test.ll -S

eastig commented 6 years ago

In a comment to D43079 Adam wrote they saw 482.sphinx3 regression.

eastig commented 6 years ago

Thanks. There was a comment that there were regressions on ARM (as opposed to AArch64 I think). I tried the same matmul test using an ARM triple and 'swift' as the target CPU, and it doesn't SLP vectorize today.

Do you expect vectorization for that CPU, or is there some other 32-bit CPU that I can substitute with these examples to create ARM/Thumb regression tests?

I think I will have some tests for AArch32 Thumb2, at least one might be based on CFP2006 482.sphinx3.

rotateright commented 6 years ago

Thanks. There was a comment that there were regressions on ARM (as opposed to AArch64 I think). I tried the same matmul test using an ARM triple and 'swift' as the target CPU, and it doesn't SLP vectorize today.

Do you expect vectorization for that CPU, or is there some other 32-bit CPU that I can substitute with these examples to create ARM/Thumb regression tests?

eastig commented 6 years ago

tsc-s352.ll A regression test for https://reviews.llvm.org/D43079 The test is based on http://www.llvm.org/viewvc/llvm-project/test-suite/trunk/MultiSource/Benchmarks/TSVC/LoopRerolling-flt/tsc.c?view=log

TSVC is a test suite for evaluation of vectorisers.

The test uses floats and has a loop nest.

$ diff tsc.good.ll tsc.bad.ll 48a49

%6 = load float, float %arrayidx18, align 4, !tbaa !2 50,67c51,66 < %6 = add nuw nsw i64 %indvars.iv, 3 < %arrayidx26 = getelementptr inbounds %struct.GlobalData, %struct.GlobalData @global_data, i64 0, i32 0, i64 %6 < %7 = bitcast float %arrayidx18 to <2 x float> < %8 = load <2 x float>, <2 x float> %7, align 4, !tbaa !2 < %arrayidx29 = getelementptr inbounds %struct.GlobalData, %struct.GlobalData @global_data, i64 0, i32 3, i64 %6 < %9 = bitcast float %arrayidx21 to <2 x float> < %10 = load <2 x float>, <2 x float> %9, align 4, !tbaa !2 < %11 = fmul <2 x float> %8, %10 < %12 = extractelement <2 x float> %11, i32 0 < %add23 = fadd float %add15, %12 < %13 = extractelement <2 x float> %11, i32 1 < %add31 = fadd float %add23, %13 < %14 = add nuw nsw i64 %indvars.iv, 4 < %arrayidx34 = getelementptr inbounds %struct.GlobalData, %struct.GlobalData @global_data, i64 0, i32 0, i64 %14 < %15 = load float, float %arrayidx34, align 4, !tbaa !2 < %arrayidx37 = getelementptr inbounds %struct.GlobalData, %struct.GlobalData @global_data, i64 0, i32 3, i64 %14 < %16 = load float, float* %arrayidx37, align 4, !tbaa !2 < %mul38 = fmul float %15, %16

%7 = load float, float %arrayidx21, align 4, !tbaa !2 %mul22 = fmul float %6, %7 %add23 = fadd float %add15, %mul22 %8 = add nuw nsw i64 %indvars.iv, 3 %arrayidx26 = getelementptr inbounds %struct.GlobalData, %struct.GlobalData @global_data, i64 0, i32 0, i64 %8 %9 = load float, float %arrayidx26, align 4, !tbaa !2 %arrayidx29 = getelementptr inbounds %struct.GlobalData, %struct.GlobalData @global_data, i64 0, i32 3, i64 %8 %10 = load float, float %arrayidx29, align 4, !tbaa !2 %mul30 = fmul float %9, %10 %add31 = fadd float %add23, %mul30 %11 = add nuw nsw i64 %indvars.iv, 4 %arrayidx34 = getelementptr inbounds %struct.GlobalData, %struct.GlobalData @global_data, i64 0, i32 0, i64 %11 %12 = load float, float %arrayidx34, align 4, !tbaa !2 %arrayidx37 = getelementptr inbounds %struct.GlobalData, %struct.GlobalData @global_data, i64 0, i32 3, i64 %11 %13 = load float, float* %arrayidx37, align 4, !tbaa !2 %mul38 = fmul float %12, %13

Sanjay, I think there will be more test cases.

eastig commented 6 years ago

Created attachment 19922 [details] AArch64 Cortex-A53 regression test

A regression test for https://reviews.llvm.org/D43079 The test is based on http://www.llvm.org/viewvc/llvm-project/test-suite/trunk/SingleSource/ Benchmarks/Misc/matmul_f64_4x4.c

Thanks! I added this to the regression tests: https://reviews.llvm.org/rL325717

Feel free to adjust the comments or the test if I missed something. If we can get more like this (reductions based on the SPEC regressions), that would be excellent. That way, I won't get caught with just x86 test diffs next time. :)

LGTM.

I'll have a look if SPEC regressions have a different pattern.

rotateright commented 6 years ago

Created attachment 19922 [details] AArch64 Cortex-A53 regression test

A regression test for https://reviews.llvm.org/D43079 The test is based on http://www.llvm.org/viewvc/llvm-project/test-suite/trunk/SingleSource/ Benchmarks/Misc/matmul_f64_4x4.c

Thanks! I added this to the regression tests: https://reviews.llvm.org/rL325717

Feel free to adjust the comments or the test if I missed something. If we can get more like this (reductions based on the SPEC regressions), that would be excellent. That way, I won't get caught with just x86 test diffs next time. :)

eastig commented 6 years ago

AArch64 Cortex-A53 regression test A regression test for https://reviews.llvm.org/D43079 The test is based on http://www.llvm.org/viewvc/llvm-project/test-suite/trunk/SingleSource/Benchmarks/Misc/matmul_f64_4x4.c

--- Before D43079 --- $ opt -slp-vectorizer -S -o - arm64-a53-test.ll ... define dso_local void @wrap_mul4(double nocapture %Out, [2 x double] nocapture readonly %A, [4 x double] nocapture readonly %B) local_unnamed_addr #0 { entry: %arrayidx1.i = getelementptr inbounds [2 x double], [2 x double] %A, i64 0, i64 0 %0 = load double, double %arrayidx1.i, align 8, !tbaa !2 %arrayidx3.i = getelementptr inbounds [4 x double], [4 x double] %B, i64 0, i64 0 %arrayidx5.i = getelementptr inbounds [2 x double], [2 x double] %A, i64 0, i64 1 %1 = load double, double %arrayidx5.i, align 8, !tbaa !2 %arrayidx7.i = getelementptr inbounds [4 x double], [4 x double] %B, i64 1, i64 0 %arrayidx13.i = getelementptr inbounds [4 x double], [4 x double] %B, i64 0, i64 1 %2 = bitcast double %arrayidx3.i to <2 x double> %3 = load <2 x double>, <2 x double> %2, align 8, !tbaa !2 %4 = insertelement <2 x double> undef, double %0, i32 0 %5 = insertelement <2 x double> %4, double %0, i32 1 %6 = fmul <2 x double> %5, %3 %arrayidx18.i = getelementptr inbounds [4 x double], [4 x double] %B, i64 1, i64 1 %7 = bitcast double %arrayidx7.i to <2 x double> %8 = load <2 x double>, <2 x double> %7, align 8, !tbaa !2 %9 = insertelement <2 x double> undef, double %1, i32 0 %10 = insertelement <2 x double> %9, double %1, i32 1 %11 = fmul <2 x double> %10, %8 %12 = fadd <2 x double> %6, %11 %arrayidx25.i = getelementptr inbounds [4 x double], [4 x double] %B, i64 0, i64 2 %arrayidx30.i = getelementptr inbounds [4 x double], [4 x double] %B, i64 1, i64 2 %arrayidx37.i = getelementptr inbounds [4 x double], [4 x double] %B, i64 0, i64 3 %13 = bitcast double %arrayidx25.i to <2 x double> %14 = load <2 x double>, <2 x double> %13, align 8, !tbaa !2 %15 = fmul <2 x double> %5, %14 %arrayidx42.i = getelementptr inbounds [4 x double], [4 x double] %B, i64 1, i64 3 %16 = bitcast double %arrayidx30.i to <2 x double> %17 = load <2 x double>, <2 x double> %16, align 8, !tbaa !2 %18 = fmul <2 x double> %10, %17 %19 = fadd <2 x double> %15, %18 %arrayidx47.i = getelementptr inbounds [2 x double], [2 x double] %A, i64 1, i64 0 %20 = load double, double %arrayidx47.i, align 8, !tbaa !2 %arrayidx52.i = getelementptr inbounds [2 x double], [2 x double] %A, i64 1, i64 1 %21 = load double, double* %arrayidx52.i, align 8, !tbaa !2 %22 = insertelement <2 x double> undef, double %20, i32 0 %23 = insertelement <2 x double> %22, double %20, i32 1 %24 = fmul <2 x double> %3, %23 %25 = insertelement <2 x double> undef, double %21, i32 0 %26 = insertelement <2 x double> %25, double %21, i32 1 %27 = fmul <2 x double> %8, %26 %28 = fadd <2 x double> %24, %27 %29 = fmul <2 x double> %14, %23 %30 = fmul <2 x double> %17, %26 %31 = fadd <2 x double> %29, %30 ...

--- After D43079 --- $ opt -slp-vectorizer -S -o - arm64-a53-test.ll ... define dso_local void @wrap_mul4(double nocapture %Out, [2 x double] nocapture readonly %A, [4 x double] nocapture readonly %B) local_unnamed_addr #0 { entry: %arrayidx1.i = getelementptr inbounds [2 x double], [2 x double] %A, i64 0, i64 0 %0 = load double, double %arrayidx1.i, align 8, !tbaa !2 %arrayidx3.i = getelementptr inbounds [4 x double], [4 x double] %B, i64 0, i64 0 %1 = load double, double %arrayidx3.i, align 8, !tbaa !2 %mul.i = fmul double %0, %1 %arrayidx5.i = getelementptr inbounds [2 x double], [2 x double] %A, i64 0, i64 1 %2 = load double, double %arrayidx5.i, align 8, !tbaa !2 %arrayidx7.i = getelementptr inbounds [4 x double], [4 x double] %B, i64 1, i64 0 %3 = load double, double %arrayidx7.i, align 8, !tbaa !2 %mul8.i = fmul double %2, %3 %add.i = fadd double %mul.i, %mul8.i %arrayidx13.i = getelementptr inbounds [4 x double], [4 x double] %B, i64 0, i64 1 %4 = load double, double %arrayidx13.i, align 8, !tbaa !2 %mul14.i = fmul double %0, %4 %arrayidx18.i = getelementptr inbounds [4 x double], [4 x double] %B, i64 1, i64 1 %5 = load double, double %arrayidx18.i, align 8, !tbaa !2 %mul19.i = fmul double %2, %5 %add20.i = fadd double %mul14.i, %mul19.i %arrayidx25.i = getelementptr inbounds [4 x double], [4 x double] %B, i64 0, i64 2 %6 = load double, double %arrayidx25.i, align 8, !tbaa !2 %mul26.i = fmul double %0, %6 %arrayidx30.i = getelementptr inbounds [4 x double], [4 x double] %B, i64 1, i64 2 %7 = load double, double %arrayidx30.i, align 8, !tbaa !2 %mul31.i = fmul double %2, %7 %add32.i = fadd double %mul26.i, %mul31.i %arrayidx37.i = getelementptr inbounds [4 x double], [4 x double] %B, i64 0, i64 3 %8 = load double, double %arrayidx37.i, align 8, !tbaa !2 %mul38.i = fmul double %0, %8 %arrayidx42.i = getelementptr inbounds [4 x double], [4 x double] %B, i64 1, i64 3 %9 = load double, double %arrayidx42.i, align 8, !tbaa !2 %mul43.i = fmul double %2, %9 %add44.i = fadd double %mul38.i, %mul43.i %arrayidx47.i = getelementptr inbounds [2 x double], [2 x double] %A, i64 1, i64 0 %10 = load double, double %arrayidx47.i, align 8, !tbaa !2 %mul50.i = fmul double %1, %10 %arrayidx52.i = getelementptr inbounds [2 x double], [2 x double] %A, i64 1, i64 1 %11 = load double, double* %arrayidx52.i, align 8, !tbaa !2 %mul55.i = fmul double %3, %11 %add56.i = fadd double %mul50.i, %mul55.i %mul62.i = fmul double %4, %10 %mul67.i = fmul double %5, %11 %add68.i = fadd double %mul62.i, %mul67.i %mul74.i = fmul double %6, %10 %mul79.i = fmul double %7, %11 %add80.i = fadd double %mul74.i, %mul79.i %mul86.i = fmul double %8, %10 %mul91.i = fmul double %9, %11 %add92.i = fadd double %mul86.i, %mul91.i ...

rotateright commented 6 years ago

Had to revert, so reopening: https://reviews.llvm.org/rL325658

rotateright commented 6 years ago

To summarize some of the discussion in: https://reviews.llvm.org/D43079 https://reviews.llvm.org/D42981

The SLP vectorizer is not calculating cost by summing uops, so my analysis in comment 4 is wrong. Why the SLP vectorizer is calculating cost as it does is another question.

https://reviews.llvm.org/rL325515

This commit fixes the problem accidentally, but we have a regression test to verify that the motivating case for this bug does not regress.

...so resolving as fixed.

alexey-bataev commented 6 years ago

By the way, I think that if the load instruction is potentially foldable and it is used only once, we can assume it is going to be folded

alexey-bataev commented 6 years ago

No, in this particular case loads do not exist in hardware. They are completely masked by mul operations and the cost of these loads is calculated already when we calculated the cost of 2 scalar mul operations. Yes, maybe it is too early to speculate about load folding, but actually we already do this for other operations. For example, the cost model may consider some casting operations as free, though we don't know, how they will be represented in the resulting code

rotateright commented 6 years ago

Sanjay, your change will fix this particular case, but not the whole problem. We have exactly the same problem not only with floats, but also with integers.

I agree - my suggestion will only attempt to make the cost of FP ops more realistic. The fact that it helps this particular bug might be a fluke, but I'll take it. :)

We don't have scalar loads in the resulting code, they are part of the vmulss instructions.

I don't agree with this reasoning. The loads clearly exist in IR (and in hardware). The fact that they might get folded should be irrelevant in this cost model.

In fact, we can't assume anything about load folding this early in the pipeline. The backend can choose to fold or not, or it may not even be possible if the operand is not aligned correctly.

alexey-bataev commented 6 years ago

Sanjay, your change will fix this particular case, but not the whole problem. We have exactly the same problem not only with floats, but also with integers. The main problem that we incorrectly calculate the cost of scalar loads. In this particular example, the cost of loads is incorrect. We see, that the cost is -1, but it should be 1. We don't have scalar loads in the resulting code, they are part of the vmulss instructions. But we take the cost of %p1 and %p2 loads as 2, though, actually it is 0. The cost of vector load is 1. So, the resulting cost of the loads vectorization must be 1, not -1.

rotateright commented 6 years ago

So the net cost should be 0 - we remove 2 scalar ops, and add 2 for insert/extract.

I have confirmed that the cost model change I have suggested here causes us to not SLP vectorize the key loop in the himeno benchmark.

This restores the lost performance (6%) on my AMD Jaguar test system. Of course, if the cost model says the 2 sequences have the same cost, but the vector code really costs that much more, then we're still not very accurate in our cost model.

rotateright commented 6 years ago

Let me list what I see as I step through this case with debug output:

SLP: Adding cost -2 for bundle that starts with %mul1 = fmul float %p1, %x. SLP: Adding cost 1 for bundle that starts with float %x. SLP: Adding cost -1 for bundle that starts with %p1 = load float, float* %gep1, align 2.

That first line shows the bug - we correctly see that we are going to eliminate 1 fmul by vectorizing. The cost savings of killing one fmul should be exactly 1, not 2.

The insert, extract, and load calculations look right to me: +1 insert (one of them is free) +1 extract (one of them is free) -1 for removing a scalar load

So the net cost should be 0 - we remove 2 scalar ops, and add 2 for insert/extract.

rotateright commented 6 years ago

I don't understand all of the cases in D42981 yet, but I think this particular problem is caused by a long-standing and known bug in the throughput cost model:

// Assume that floating point arithmetic operations cost twice as much as
// integer operations.
unsigned OpCost = (IsFloat ? 2 : 1);

This makes no sense. If we change this so all arithmetic ops have cost 1 by default (because all basic math ops have the same throughput), we don't try to vectorize this example.

I have no idea yet what that change would do to other examples.

alexey-bataev commented 6 years ago

Patch https://reviews.llvm.org/D42981

alexey-bataev commented 6 years ago

Have a patch for it already, but need to fix the cost model a bit.

rotateright commented 6 years ago

assigned to @alexey-bataev

llvm / llvm-project

[SLPVectorizer] 2-way parallel vectorization hurts perf on x86 #35628

Extended Description