llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.37k stars 12.14k forks source link

short fixed count loops are not vectorized without #pragma hints #28808

Open llvmbot opened 8 years ago

llvmbot commented 8 years ago
Bugzilla Link 28434
Version 3.8
OS All
Depends On llvm/llvm-project#28835
Reporter LLVM Bugzilla Contributor
CC @efriedma-quic,@hfinkel,@RKSimon,@rotateright

Extended Description

see https://godbolt.org/g/bLIzA1

void f1(float restrict q attribute((align_value(16))), int restrict p attribute((align_value(16)))) {

pragma clang loop unroll(disable)

pragma clang loop vectorize(enable)

for (unsigned int i = 0; i < 4; ++i) { q[i] = p[i]; } } void f2(float restrict q attribute((align_value(16))), int restrict p attribute((align_value(16)))) {

pragma clang loop vectorize(enable)

for (unsigned int i = 0; i < 4; ++i) { q[i] = p[i]; } } void f3(float restrict q attribute((align_value(16))), int restrict p attribute((align_value(16)))) {

pragma clang loop unroll(disable)

for (unsigned int i = 0; i < 4; ++i) { q[i] = p[i]; } } void f4(float restrict q attribute((align_value(16))), int restrict p attribute((align_value(16)))) { for (unsigned int i = 0; i < 4; ++i) { q[i] = p[i]; } } void f5(float restrict q attribute((align_value(16))), int restrict p attribute((align_value(16)))) { for (unsigned int i = 0; i < 40; ++i) { q[i] = p[i]; } }

f1 is correctly vectorized when vectorization is forced and loop unrolling disabled: cvtdq2ps (%rsi), %xmm0 movaps %xmm0, (%rdi)

f2 is not correctly vectorized even when vectorization is forced because loop is fully unrolled before vectorization?: cvtsi2ssl (%rsi), %xmm0 movss %xmm0, (%rdi) xorps %xmm0, %xmm0 cvtsi2ssl 4(%rsi), %xmm0 movss %xmm0, 4(%rdi) xorps %xmm0, %xmm0 ...

f3 is not correctly vectorized when loop unrolling is disabled but vectorization is not forced: xorl %eax, %eax .LBB2_1: # =>This Inner Loop Header: Depth=1 xorps %xmm0, %xmm0 cvtsi2ssl (%rsi,%rax,4), %xmm0 movss %xmm0, (%rdi,%rax,4) incq %rax cmpq $4, %rax jne .LBB2_1

f4 without any #pragma is not correctly vectorized but is unrolled and produces same code as f2.

f5 without any #pragma is correctly vectorized and unrolled only when loop count is ~40 or higher: cvtdq2ps (%rsi), %xmm0 movaps %xmm0, (%rdi) cvtdq2ps 16(%rsi), %xmm0 movaps %xmm0, 16(%rdi) ...

llvmbot commented 3 years ago

mentioned in issue llvm/llvm-project#28835

llvmbot commented 8 years ago

Also, I should learn to read - the bug is specifically about short fixed-count loops. For that case, to clarify what I wrote before - we will not vectorize if we have, like in f3(), the combination of: 1) #pragma clang loop unroll (disable) 2) No #pragma clang loop vectorize (enable) 3) A loop count < 16. This will not be caught by either the loop vectorizer (due to (2) + (3)) or the SLP vectorizer (due to (1)).

That is more or less by design, although I'm not 100% sure this is really the desired behavior.

llvmbot commented 8 years ago

2) In some sense the cost model is actually correct, because we don't generate non-VEX cvttpd2dq even when the IR is vectorized, even though it exists since SSE2. An ISel issue?

Yes, probably something to do with type legalization, since <2 x i32> isn't legal.

Right. This would probably require custom legalization of some sort.

efriedma-quic commented 8 years ago

2) In some sense the cost model is actually correct, because we don't generate non-VEX cvttpd2dq even when the IR is vectorized, even though it exists since SSE2. An ISel issue?

Yes, probably something to do with type legalization, since <2 x i32> isn't legal.

llvmbot commented 8 years ago

First, we need to be careful with loop bounds of 2 or 4. The loop vectorizer, unless forced with a pragma, ignores all loops with trip count < 16. So if we get auto-vectorization for low bounds, it's from unrolling + SLP vectorizer.

Regarding the "double to int" cases, this is purely an SSE issue. If the loop bound is high enough, (>= 16), or with the pragma, we vectorize AVX and above.

For: void f1(int restrict q attribute((align_value(16))), double restrict p attribute((align_value(16)))) { for (unsigned int i = 0; i < 16; ++i) { q[i] = p[i]; } }

With -mavx -fno-unroll-loops we get: .LBB0_1: # %vector.body vcvttpd2dqy (%rsi,%rax,8), %xmm0 vmovapd %xmm0, (%rdi,%rax,4) addq $4, %rax cmpq $16, %rax jne .LBB0_1

For SSE4.2 and below, this is weirder: 1) We don't vectorize because the cost model considers the vector convert too expensive. This is probably incorrect, and is probably, again, related to the scalar convert being too cheap (cost of 1 for scalar convert, 6 for vector...) 2) In some sense the cost model is actually correct, because we don't generate non-VEX cvttpd2dq even when the IR is vectorized, even though it exists since SSE2. An ISel issue?

llvmbot commented 8 years ago

Same issue with int to double conversion: void f1(double restrict q attribute((align_value(16))), int restrict p attribute((align_value(16)))) {

pragma clang loop unroll(disable)

pragma clang loop vectorize(enable)

for (unsigned int i = 0; i < 2; ++i) { q[i] = p[i]; } } void f4(double restrict q attribute((align_value(16))), int restrict p attribute((align_value(16)))) { for (unsigned int i = 0; i < 2; ++i) { q[i] = p[i]; } }

f1(double, int): # @​f1(double, int) movq (%rsi), %xmm0 # xmm0 = mem[0],zero cvtdq2pd %xmm0, %xmm0 movaps %xmm0, (%rdi) f4(double, int): # @​f4(double, int) cvtsi2sdl (%rsi), %xmm0 movsd %xmm0, (%rdi) xorps %xmm0, %xmm0 cvtsi2sdl 4(%rsi), %xmm0 movsd %xmm0, 8(%rdi)

Also vectorized version should probably skip register temporary: cvtdq2pd (%rsi), %xmm0 movaps %xmm0, (%rdi)

double to int conversion: void f1(int restrict q attribute((align_value(16))), double restrict p attribute((align_value(16)))) {

pragma clang loop unroll(disable)

pragma clang loop vectorize(enable)

for (unsigned int i = 0; i < 2; ++i) { q[i] = p[i]; } } void f4(int restrict q attribute((align_value(16))), double restrict p attribute((align_value(16)))) { for (unsigned int i = 0; i < 2; ++i) { q[i] = p[i]; } }

f1 produces horrible code: f1(int, double): # @​f1(int, double) movapd (%rsi), %xmm0 cvttsd2si %xmm0, %rax movd %rax, %xmm1 shufpd $1, %xmm0, %xmm0 # xmm0 = xmm0[1,0] cvttsd2si %xmm0, %rax movd %rax, %xmm0 punpcklqdq %xmm0, %xmm1 # xmm1 = xmm1[0],xmm0[0] pshufd $232, %xmm1, %xmm0 # xmm0 = xmm1[0,2,2,3] movq %xmm0, (%rdi)

f4 is better but not vectorized: f4(int, double): # @​f4(int, double) cvttsd2si (%rsi), %eax movl %eax, (%rdi) cvttsd2si 8(%rsi), %eax movl %eax, 4(%rdi)

correctly vectorized should be: cvttpd2dq (%rsi), %xmm0 movq %xmm0, (%rdi)

hfinkel commented 8 years ago

Regardless of whether the vector op is 5 or 15, is the scalar really a 1? ...

This brings up the larger question of what exactly are we trying to model with these numbers: latency, throughput, something else?

In theory, the numbers represent throughput. The idea being that we handle latency hiding with the interleaving factor (which is computed separately). The "user" cost model used by the unroller, etc. is more subtle -- by which I mean less well defined :(

rotateright commented 8 years ago

Regardless of whether the vector op is 5 or 15, is the scalar really a 1?

That's a good question.

The scalar version of sitofp (i32 --> float) for x86 would be cvtsi2ss. This was apparently 1.5 reciprocal throughput on SandyBridge, but 3 on Merom, Nehalem, Haswell, Skylake? Seems unlikely that it magically improved for one generation and then fell back, but I haven't tested it. But based on those numbers, I think the conservative answer would be that the cost for the scalar op would be 3.

This brings up the larger question of what exactly are we trying to model with these numbers: latency, throughput, something else?

It also brings up the question from bug 21356, comment 5: how do we distinguish between micro-arches that may vary wildly in their cost for a given op?

llvmbot commented 8 years ago

Should we also, perhaps, start giving higher costs to scalar operations in these cases?

E.g. we currently have: "Found an estimated cost of 1 for VF 1 For instruction: %conv = sitofp i32 %0 to float"

Regardless of whether the vector op is 5 or 15, is the scalar really a 1?

rotateright commented 8 years ago

Thanks, Eli.

I checked in the minimal cost model change in: http://reviews.llvm.org/rL274658

...and that should fix cases f2() and f4().

I think f3() has some other problem; even with cost=1 for the sitofp:

28434.c:19:3: remark: loop not vectorized: vectorization is not beneficial and is not explicitly forced [-Rpass-analysis=loop-vectorize]

efriedma-quic commented 8 years ago

See also http://reviews.llvm.org/D21530 .

rotateright commented 8 years ago

Looks like an x86 cost model problem. f2() vectorizes for PowerPC:

$ ./clang -O2 28434.c -S -o - -target powerpc64 -emit-llvm ; ModuleID = '28434.c' source_filename = "28434.c" target datalayout = "E-m:e-i64:64-n32:64" target triple = "powerpc64"

; Function Attrs: norecurse nounwind define void @​f2(float noalias nocapture align 16 %q, i32 noalias nocapture readonly align 16 %p) local_unnamed_addr #​0 { entry: %0 = bitcast i32 %p to <4 x i32> %1 = load <4 x i32>, <4 x i32> %0, align 16, !tbaa !​1 %2 = sitofp <4 x i32> %1 to <4 x float> %3 = bitcast float %q to <4 x float> store <4 x float> %2, <4 x float> %3, align 16, !tbaa !​5 ret void }

attributes #​0 = { norecurse nounwind "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="ppc64" "target-features"="+altivec,-bpermd,-crypto,-direct-move,-extdiv,-power8-vector,-qpx,-vsx" "unsafe-fp-math"="false" "use-soft-float"="false" }

!llvm.ident = !{#0}

!​0 = !{!"clang version 3.9.0 (trunk 274633)"} !​1 = !{#2, !​2, i64 0} !​2 = !{!"int", !​3, i64 0} !​3 = !{!"omnipotent char", !​4, i64 0} !​4 = !{!"Simple C/C++ TBAA"} !​5 = !{#6, !​6, i64 0} !​6 = !{!"float", !​3, i64 0}