Open llvmbot opened 8 years ago
mentioned in issue llvm/llvm-project#28835
Also, I should learn to read - the bug is specifically about short fixed-count loops. For that case, to clarify what I wrote before - we will not vectorize if we have, like in f3(), the combination of: 1) #pragma clang loop unroll (disable) 2) No #pragma clang loop vectorize (enable) 3) A loop count < 16. This will not be caught by either the loop vectorizer (due to (2) + (3)) or the SLP vectorizer (due to (1)).
That is more or less by design, although I'm not 100% sure this is really the desired behavior.
2) In some sense the cost model is actually correct, because we don't generate non-VEX cvttpd2dq even when the IR is vectorized, even though it exists since SSE2. An ISel issue?
Yes, probably something to do with type legalization, since <2 x i32> isn't legal.
Right. This would probably require custom legalization of some sort.
2) In some sense the cost model is actually correct, because we don't generate non-VEX cvttpd2dq even when the IR is vectorized, even though it exists since SSE2. An ISel issue?
Yes, probably something to do with type legalization, since <2 x i32> isn't legal.
First, we need to be careful with loop bounds of 2 or 4. The loop vectorizer, unless forced with a pragma, ignores all loops with trip count < 16. So if we get auto-vectorization for low bounds, it's from unrolling + SLP vectorizer.
Regarding the "double to int" cases, this is purely an SSE issue. If the loop bound is high enough, (>= 16), or with the pragma, we vectorize AVX and above.
For: void f1(int restrict q attribute((align_value(16))), double restrict p attribute((align_value(16)))) { for (unsigned int i = 0; i < 16; ++i) { q[i] = p[i]; } }
With -mavx -fno-unroll-loops we get: .LBB0_1: # %vector.body vcvttpd2dqy (%rsi,%rax,8), %xmm0 vmovapd %xmm0, (%rdi,%rax,4) addq $4, %rax cmpq $16, %rax jne .LBB0_1
For SSE4.2 and below, this is weirder: 1) We don't vectorize because the cost model considers the vector convert too expensive. This is probably incorrect, and is probably, again, related to the scalar convert being too cheap (cost of 1 for scalar convert, 6 for vector...) 2) In some sense the cost model is actually correct, because we don't generate non-VEX cvttpd2dq even when the IR is vectorized, even though it exists since SSE2. An ISel issue?
Same issue with int to double conversion: void f1(double restrict q attribute((align_value(16))), int restrict p attribute((align_value(16)))) {
for (unsigned int i = 0; i < 2; ++i) { q[i] = p[i]; } } void f4(double restrict q attribute((align_value(16))), int restrict p attribute((align_value(16)))) { for (unsigned int i = 0; i < 2; ++i) { q[i] = p[i]; } }
f1(double, int): # @f1(double, int) movq (%rsi), %xmm0 # xmm0 = mem[0],zero cvtdq2pd %xmm0, %xmm0 movaps %xmm0, (%rdi) f4(double, int): # @f4(double, int) cvtsi2sdl (%rsi), %xmm0 movsd %xmm0, (%rdi) xorps %xmm0, %xmm0 cvtsi2sdl 4(%rsi), %xmm0 movsd %xmm0, 8(%rdi)
Also vectorized version should probably skip register temporary: cvtdq2pd (%rsi), %xmm0 movaps %xmm0, (%rdi)
double to int conversion: void f1(int restrict q attribute((align_value(16))), double restrict p attribute((align_value(16)))) {
for (unsigned int i = 0; i < 2; ++i) { q[i] = p[i]; } } void f4(int restrict q attribute((align_value(16))), double restrict p attribute((align_value(16)))) { for (unsigned int i = 0; i < 2; ++i) { q[i] = p[i]; } }
f1 produces horrible code: f1(int, double): # @f1(int, double) movapd (%rsi), %xmm0 cvttsd2si %xmm0, %rax movd %rax, %xmm1 shufpd $1, %xmm0, %xmm0 # xmm0 = xmm0[1,0] cvttsd2si %xmm0, %rax movd %rax, %xmm0 punpcklqdq %xmm0, %xmm1 # xmm1 = xmm1[0],xmm0[0] pshufd $232, %xmm1, %xmm0 # xmm0 = xmm1[0,2,2,3] movq %xmm0, (%rdi)
f4 is better but not vectorized: f4(int, double): # @f4(int, double) cvttsd2si (%rsi), %eax movl %eax, (%rdi) cvttsd2si 8(%rsi), %eax movl %eax, 4(%rdi)
correctly vectorized should be: cvttpd2dq (%rsi), %xmm0 movq %xmm0, (%rdi)
Regardless of whether the vector op is 5 or 15, is the scalar really a 1? ...
This brings up the larger question of what exactly are we trying to model with these numbers: latency, throughput, something else?
In theory, the numbers represent throughput. The idea being that we handle latency hiding with the interleaving factor (which is computed separately). The "user" cost model used by the unroller, etc. is more subtle -- by which I mean less well defined :(
Regardless of whether the vector op is 5 or 15, is the scalar really a 1?
That's a good question.
The scalar version of sitofp (i32 --> float) for x86 would be cvtsi2ss. This was apparently 1.5 reciprocal throughput on SandyBridge, but 3 on Merom, Nehalem, Haswell, Skylake? Seems unlikely that it magically improved for one generation and then fell back, but I haven't tested it. But based on those numbers, I think the conservative answer would be that the cost for the scalar op would be 3.
This brings up the larger question of what exactly are we trying to model with these numbers: latency, throughput, something else?
It also brings up the question from bug 21356, comment 5: how do we distinguish between micro-arches that may vary wildly in their cost for a given op?
Should we also, perhaps, start giving higher costs to scalar operations in these cases?
E.g. we currently have: "Found an estimated cost of 1 for VF 1 For instruction: %conv = sitofp i32 %0 to float"
Regardless of whether the vector op is 5 or 15, is the scalar really a 1?
Thanks, Eli.
I checked in the minimal cost model change in: http://reviews.llvm.org/rL274658
...and that should fix cases f2() and f4().
I think f3() has some other problem; even with cost=1 for the sitofp:
28434.c:19:3: remark: loop not vectorized: vectorization is not beneficial and is not explicitly forced [-Rpass-analysis=loop-vectorize]
See also http://reviews.llvm.org/D21530 .
Looks like an x86 cost model problem. f2() vectorizes for PowerPC:
$ ./clang -O2 28434.c -S -o - -target powerpc64 -emit-llvm ; ModuleID = '28434.c' source_filename = "28434.c" target datalayout = "E-m:e-i64:64-n32:64" target triple = "powerpc64"
; Function Attrs: norecurse nounwind define void @f2(float noalias nocapture align 16 %q, i32 noalias nocapture readonly align 16 %p) local_unnamed_addr #0 { entry: %0 = bitcast i32 %p to <4 x i32> %1 = load <4 x i32>, <4 x i32> %0, align 16, !tbaa !1 %2 = sitofp <4 x i32> %1 to <4 x float> %3 = bitcast float %q to <4 x float> store <4 x float> %2, <4 x float> %3, align 16, !tbaa !5 ret void }
attributes #0 = { norecurse nounwind "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="ppc64" "target-features"="+altivec,-bpermd,-crypto,-direct-move,-extdiv,-power8-vector,-qpx,-vsx" "unsafe-fp-math"="false" "use-soft-float"="false" }
!llvm.ident = !{#0}
!0 = !{!"clang version 3.9.0 (trunk 274633)"} !1 = !{#2, !2, i64 0} !2 = !{!"int", !3, i64 0} !3 = !{!"omnipotent char", !4, i64 0} !4 = !{!"Simple C/C++ TBAA"} !5 = !{#6, !6, i64 0} !6 = !{!"float", !3, i64 0}
Extended Description
see https://godbolt.org/g/bLIzA1
void f1(float restrict q attribute((align_value(16))), int restrict p attribute((align_value(16)))) {
pragma clang loop unroll(disable)
pragma clang loop vectorize(enable)
for (unsigned int i = 0; i < 4; ++i) { q[i] = p[i]; } } void f2(float restrict q attribute((align_value(16))), int restrict p attribute((align_value(16)))) {
pragma clang loop vectorize(enable)
for (unsigned int i = 0; i < 4; ++i) { q[i] = p[i]; } } void f3(float restrict q attribute((align_value(16))), int restrict p attribute((align_value(16)))) {
pragma clang loop unroll(disable)
for (unsigned int i = 0; i < 4; ++i) { q[i] = p[i]; } } void f4(float restrict q attribute((align_value(16))), int restrict p attribute((align_value(16)))) { for (unsigned int i = 0; i < 4; ++i) { q[i] = p[i]; } } void f5(float restrict q attribute((align_value(16))), int restrict p attribute((align_value(16)))) { for (unsigned int i = 0; i < 40; ++i) { q[i] = p[i]; } }
f1 is correctly vectorized when vectorization is forced and loop unrolling disabled: cvtdq2ps (%rsi), %xmm0 movaps %xmm0, (%rdi)
f2 is not correctly vectorized even when vectorization is forced because loop is fully unrolled before vectorization?: cvtsi2ssl (%rsi), %xmm0 movss %xmm0, (%rdi) xorps %xmm0, %xmm0 cvtsi2ssl 4(%rsi), %xmm0 movss %xmm0, 4(%rdi) xorps %xmm0, %xmm0 ...
f3 is not correctly vectorized when loop unrolling is disabled but vectorization is not forced: xorl %eax, %eax .LBB2_1: # =>This Inner Loop Header: Depth=1 xorps %xmm0, %xmm0 cvtsi2ssl (%rsi,%rax,4), %xmm0 movss %xmm0, (%rdi,%rax,4) incq %rax cmpq $4, %rax jne .LBB2_1
f4 without any #pragma is not correctly vectorized but is unrolled and produces same code as f2.
f5 without any #pragma is correctly vectorized and unrolled only when loop count is ~40 or higher: cvtdq2ps (%rsi), %xmm0 movaps %xmm0, (%rdi) cvtdq2ps 16(%rsi), %xmm0 movaps %xmm0, 16(%rdi) ...