Open llvmbot opened 6 years ago
This variant also includes a const loop count: https://godbolt.org/z/OkPkSf
void patternFill_const(int *arr) { for (int i = 0; i < 65536; i++) arr[i] = (i & 1) ? 456 : 123; }
Here we should definitely keep the loop's indvar as a i32/vXi32 instead of extending it to i64/vXi64 - we can guarantee that i + #loopvectorelts never overflows.
It'd improve the loop but probably wouldn't do much to help optimize to a constant select mask though......
Current Codegen: https://godbolt.org/z/R_K7FG
This is still an issue. It looks like we do not manage to fold vector add,and,select chain to vectorizer generates.
Extended Description
void patternFill(int *arr, int count) { for (int i = 0; i < count; i++) arr[i] = (i & 1) ? 456 : 123; }
With clang 6.0 release, "clang -O2" on x86-64, this turns into a lot of code, with the inner loop being
.LBB0_8: # =>This Inner Loop Header: Depth=1 movdqa %xmm1, %xmm7 pand %xmm4, %xmm7 movdqa %xmm2, %xmm0 pand %xmm4, %xmm0 pcmpeqd %xmm3, %xmm0 pshufd $177, %xmm0, %xmm5 # xmm5 = xmm0[1,0,3,2] pand %xmm0, %xmm5 pcmpeqd %xmm3, %xmm7 pshufd $177, %xmm7, %xmm0 # xmm0 = xmm7[1,0,3,2] pand %xmm7, %xmm0 shufps $136, %xmm0, %xmm5 # xmm5 = xmm5[0,2],xmm0[0,2] movaps %xmm5, %xmm0 andnps %xmm8, %xmm0 andps %xmm9, %xmm5 orps %xmm0, %xmm5 movups %xmm5, (%rdi,%rsi,4) movups %xmm5, 16(%rdi,%rsi,4) movups %xmm5, 32(%rdi,%rsi,4) movups %xmm5, 48(%rdi,%rsi,4) addq $16, %rsi paddq %xmm6, %xmm2 paddq %xmm6, %xmm1 addq $4, %rax jne .LBB0_8
by comparison, compiling with "-fno-vectorize" results in the much better (in terms of both code size and expected execution time)
movaps .LCPI0_0(%rip), %xmm0 # xmm0 = [123,456,123,456] .LBB0_8: # =>This Inner Loop Header: Depth=1 movups %xmm0, (%rdi,%rcx,4) addq $4, %rcx cmpq %rcx, %rdx jne .LBB0_8