Open SoapGentoo opened 3 years ago
I've gone through and tweaked the AVX2 truncate and non-uniform shift costs to more closely match the 'worse case scenario' that we're supposed to keep to (we really need some way to automate this......)
This will prevent the regression on pre-AVX512 targets, but on Skylake etc. with better trunc/shift ops the SLP will still produce the partial vectorization nightmare which the load/store combines can't recover from.
So we're probably at the limit of what we can achieve with cost model accuracy alone.
If we vectorized the entire thing, we might be able to handle a store(trunc(lshr(broadcast(),c))) pattern in store combining - but that would require us finally fixing [Bug #31572].
Also, I'm not sure whether this should be a 12.x blocker - this is just a performance regression, not a correctness regression (not sure what the exact policy is for this?).
I guess you could look at this from two different perspectives:
If we're looking at SLP vectorization itself, I see the following issues:
Not to dismiss the bug, but I'm curious if using __builtin_bswap64() is an option? Or is that not portable enough to other compilers?
Note that we may have made SLP less likely to avoid this pattern with the fix for bug 50256 ( https://reviews.llvm.org/rG49950cb1f6f6 )...although this isn't quite the same - there are no loads to combine here; this is just truncates, shifts, and stores.
I'm sympathetic with the bugreport. We could have ended up with this pattern after loop unrolling, where the load was templated over the type size.
I think we should be making a much better effort to match bswap intrinsics, and certainly we shouldn't prevent them from being detected.
Not to dismiss the bug, but I'm curious if using __builtin_bswap64() is an option? Or is that not portable enough to other compilers?
Note that we may have made SLP less likely to avoid this pattern with the fix for bug 50256 ( https://reviews.llvm.org/rG49950cb1f6f6 )...although this isn't quite the same - there are no loads to combine here; this is just truncates, shifts, and stores.
Godbolt: https://godbolt.org/z/4b4dcf9T9
Looks like SLPVectorizer is acting up again, for 32/64 -bit cases. https://godbolt.org/z/csrh5x1aM
https://godbolt.org/z/zfrc65YfW
GCC 11.1 and Clang 11 fine, Clang 12 is off.
Please can you post standalone reproducers via godbolt links?
Extended Description
I have the following repo with Rob Pike's endianness tricks: https://github.com/SoapGentoo/portable-endianness
It seems Clang 12 broke the 64-bit store routines.
Clang 11.0.1: store64_to_LE(unsigned long, unsigned char): # @store64_to_LE(unsigned long, unsigned char) mov qword ptr [rsi], rdi ret store64_to_BE(unsigned long, unsigned char): # @store64_to_BE(unsigned long, unsigned char) movbe qword ptr [rsi], rdi ret
Clang 12.0.0: .LCPI8_0: .quad 8 # 0x8 .quad 16 # 0x10 .quad 24 # 0x18 .quad 32 # 0x20 .LCPI8_1: .byte 0 # 0x0 .byte 8 # 0x8 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 store64_to_LE(unsigned long, unsigned char): # @store64_to_LE(unsigned long, unsigned char) mov byte ptr [rsi], dil vmovq xmm0, rdi vpbroadcastq ymm0, xmm0 vpsrlvq ymm0, ymm0, ymmword ptr [rip + .LCPI8_0] vextracti128 xmm1, ymm0, 1 vmovdqa xmm2, xmmword ptr [rip + .LCPI8_1] # xmm2 = <0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u> vpshufb xmm1, xmm1, xmm2 vpshufb xmm0, xmm0, xmm2 vpunpcklwd xmm0, xmm0, xmm1 # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3] vmovd dword ptr [rsi + 1], xmm0 mov rax, rdi shr rax, 40 mov byte ptr [rsi + 5], al mov rax, rdi shr rax, 48 mov byte ptr [rsi + 6], al shr rdi, 56 mov byte ptr [rsi + 7], dil vzeroupper ret .LCPI11_0: .quad 56 # 0x38 .quad 48 # 0x30 .quad 40 # 0x28 .quad 32 # 0x20 .LCPI11_1: .byte 0 # 0x0 .byte 8 # 0x8 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 store64_to_BE(unsigned long, unsigned char): # @store64_to_BE(unsigned long, unsigned char) mov rax, rdi vmovq xmm0, rdi vpbroadcastq ymm0, xmm0 vpsrlvq ymm0, ymm0, ymmword ptr [rip + .LCPI11_0] vextracti128 xmm1, ymm0, 1 vmovdqa xmm2, xmmword ptr [rip + .LCPI11_1] # xmm2 = <0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u> vpshufb xmm1, xmm1, xmm2 vpshufb xmm0, xmm0, xmm2 vpunpcklwd xmm0, xmm0, xmm1 # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3] vmovd dword ptr [rsi], xmm0 mov rcx, rdi shr rcx, 24 mov byte ptr [rsi + 4], cl mov rcx, rdi shr rcx, 16 mov byte ptr [rsi + 5], cl mov byte ptr [rsi + 6], ah mov byte ptr [rsi + 7], al vzeroupper ret
I consider detecting this sort of pattern very important in order to write performant, portable code.