Loop vectorizer produces extremely poor code for pattern fills

llvmbot commented 6 years ago


Bugzilla Link	37423
Version	6.0
OS	Windows NT
Reporter	LLVM Bugzilla Contributor
CC	@adibiagio,@fhahn,@gregbedwell,@hfinkel,@RKSimon,@rotateright

Extended Description

void patternFill(int *arr, int count) { for (int i = 0; i < count; i++) arr[i] = (i & 1) ? 456 : 123; }

With clang 6.0 release, "clang -O2" on x86-64, this turns into a lot of code, with the inner loop being

.LBB0_8: # =>This Inner Loop Header: Depth=1 movdqa %xmm1, %xmm7 pand %xmm4, %xmm7 movdqa %xmm2, %xmm0 pand %xmm4, %xmm0 pcmpeqd %xmm3, %xmm0 pshufd $177, %xmm0, %xmm5 # xmm5 = xmm0[1,0,3,2] pand %xmm0, %xmm5 pcmpeqd %xmm3, %xmm7 pshufd $177, %xmm7, %xmm0 # xmm0 = xmm7[1,0,3,2] pand %xmm7, %xmm0 shufps $136, %xmm0, %xmm5 # xmm5 = xmm5[0,2],xmm0[0,2] movaps %xmm5, %xmm0 andnps %xmm8, %xmm0 andps %xmm9, %xmm5 orps %xmm0, %xmm5 movups %xmm5, (%rdi,%rsi,4) movups %xmm5, 16(%rdi,%rsi,4) movups %xmm5, 32(%rdi,%rsi,4) movups %xmm5, 48(%rdi,%rsi,4) addq $16, %rsi paddq %xmm6, %xmm2 paddq %xmm6, %xmm1 addq $4, %rax jne .LBB0_8

by comparison, compiling with "-fno-vectorize" results in the much better (in terms of both code size and expected execution time)

movaps .LCPI0_0(%rip), %xmm0 # xmm0 = [123,456,123,456] .LBB0_8: # =>This Inner Loop Header: Depth=1 movups %xmm0, (%rdi,%rcx,4) addq $4, %rcx cmpq %rcx, %rdx jne .LBB0_8

RKSimon commented 5 years ago

This variant also includes a const loop count: https://godbolt.org/z/OkPkSf

void patternFill_const(int *arr) { for (int i = 0; i < 65536; i++) arr[i] = (i & 1) ? 456 : 123; }

Here we should definitely keep the loop's indvar as a i32/vXi32 instead of extending it to i64/vXi64 - we can guarantee that i + #loopvectorelts never overflows.

It'd improve the loop but probably wouldn't do much to help optimize to a constant select mask though......

RKSimon commented 5 years ago

Current Codegen: https://godbolt.org/z/R_K7FG

fhahn commented 5 years ago

This is still an issue. It looks like we do not manage to fold vector add,and,select chain to vectorizer generates.

llvm / llvm-project

Loop vectorizer produces extremely poor code for pattern fills #36771

Extended Description