Closed weiguozhi closed 8 months ago
Because the widest type in the loop is 32 bits wide. Passing -vectorizer-maximize-bandwidth will vectorize by 16.
I've started working on making -vectorizer-maximize-bandwidth the default, but it keeps getting pushed towards the bottom of my todo list.
When VF == 4, MemVT == MVT::v4i8, ValVT == MVT::v16i8, function getLoadExtAction returns Expand, so getMemoryOpCost tries to compute the cost of building a vector from scalar value, that is too high. Actually VSX has lxsiwax which can load 32 bit value to VSX register. So I add following code to PPCTargetLowering::PPCTargetLowering()
setLoadExtAction(ISD::EXTLOAD, MVT::v16i8, MVT::v4i8, Custom);
Then llvm can vectorize the loop, with -fno-unroll-loops option, now it generates:
.LBB0_4: # %vector.body
lxsiwax 37, 0, 3
lxvd2x 0, 0, 9
addi 10, 10, -4
addi 3, 3, 4
cmpldi 10, 0
xxspltw 37, 37, 1
xxswapd 32, 0
vcmpgtsb 5, 2, 5
vperm 5, 5, 5, 0
xxland 37, 37, 36
vadduwm 3, 5, 3
bne 0, .LBB0_4
...
There is still a problem, VSX instructions can handle 16 bytes in each instruction, why doesn't vectorizer try vector factor 16?
Looks like a problem in PPCTTIImpl::getMemoryOpCost, when load a type
<4 x i8> The cost is 41.https://godbolt.org/z/1hKadr47d
This case is vectorized now.
Extended Description
The source code is:
int foo(char ptr, int l) { const char const end = ptr + l; int count = 0; while (ptr < end) { count += ((signed char)(*ptr) < -0x40) ? 1 : 0; ptr++; } return count; }
When compiled with options -m64 -O2, llvm unrolls the loop 12 times,
.LBB0_4: # %vector.body
=>This Inner Loop Header: Depth=1
// the rest iterations
GCC can vectorize the loop: ... .L4: sldi 5,7,4 addi 7,7,1 lxvd2x 33,8,5 xxpermdi 33,33,33,2 vcmpgtsb 1,2,1 xxsel 33,35,36,33 vperm 12,1,5,7 vperm 1,1,5,8 vperm 6,12,0,9 vperm 12,12,0,10 vperm 11,1,0,9 vadduwm 6,6,13 vperm 13,1,0,10 vadduwm 12,12,6 vadduwm 12,11,12 vadduwm 13,13,12 bdnz .L4 ... // the rest iterations
In one of our internal testcase, llvm version is 2.7x times slower than gcc on power8.