llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.86k stars 11.91k forks source link

rob pike endianness trick not getting optimized anymore #49708

Open SoapGentoo opened 3 years ago

SoapGentoo commented 3 years ago
Bugzilla Link 50364
Version 12.0
OS Linux
Blocks llvm/llvm-project#48661
CC @alexey-bataev,@efriedma-quic,@LebedevRI,@RKSimon,@zygoloid,@rotateright,@tstellar

Extended Description

I have the following repo with Rob Pike's endianness tricks: https://github.com/SoapGentoo/portable-endianness

It seems Clang 12 broke the 64-bit store routines.

Clang 11.0.1: store64_to_LE(unsigned long, unsigned char): # @​store64_to_LE(unsigned long, unsigned char) mov qword ptr [rsi], rdi ret store64_to_BE(unsigned long, unsigned char): # @​store64_to_BE(unsigned long, unsigned char) movbe qword ptr [rsi], rdi ret

Clang 12.0.0: .LCPI8_0: .quad 8 # 0x8 .quad 16 # 0x10 .quad 24 # 0x18 .quad 32 # 0x20 .LCPI8_1: .byte 0 # 0x0 .byte 8 # 0x8 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 store64_to_LE(unsigned long, unsigned char): # @​store64_to_LE(unsigned long, unsigned char) mov byte ptr [rsi], dil vmovq xmm0, rdi vpbroadcastq ymm0, xmm0 vpsrlvq ymm0, ymm0, ymmword ptr [rip + .LCPI8_0] vextracti128 xmm1, ymm0, 1 vmovdqa xmm2, xmmword ptr [rip + .LCPI8_1] # xmm2 = <0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u> vpshufb xmm1, xmm1, xmm2 vpshufb xmm0, xmm0, xmm2 vpunpcklwd xmm0, xmm0, xmm1 # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3] vmovd dword ptr [rsi + 1], xmm0 mov rax, rdi shr rax, 40 mov byte ptr [rsi + 5], al mov rax, rdi shr rax, 48 mov byte ptr [rsi + 6], al shr rdi, 56 mov byte ptr [rsi + 7], dil vzeroupper ret .LCPI11_0: .quad 56 # 0x38 .quad 48 # 0x30 .quad 40 # 0x28 .quad 32 # 0x20 .LCPI11_1: .byte 0 # 0x0 .byte 8 # 0x8 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 .zero 1 store64_to_BE(unsigned long, unsigned char): # @​store64_to_BE(unsigned long, unsigned char) mov rax, rdi vmovq xmm0, rdi vpbroadcastq ymm0, xmm0 vpsrlvq ymm0, ymm0, ymmword ptr [rip + .LCPI11_0] vextracti128 xmm1, ymm0, 1 vmovdqa xmm2, xmmword ptr [rip + .LCPI11_1] # xmm2 = <0,8,u,u,u,u,u,u,u,u,u,u,u,u,u,u> vpshufb xmm1, xmm1, xmm2 vpshufb xmm0, xmm0, xmm2 vpunpcklwd xmm0, xmm0, xmm1 # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3] vmovd dword ptr [rsi], xmm0 mov rcx, rdi shr rcx, 24 mov byte ptr [rsi + 4], cl mov rcx, rdi shr rcx, 16 mov byte ptr [rsi + 5], cl mov byte ptr [rsi + 6], ah mov byte ptr [rsi + 7], al vzeroupper ret

I consider detecting this sort of pattern very important in order to write performant, portable code.

RKSimon commented 3 years ago

I've gone through and tweaked the AVX2 truncate and non-uniform shift costs to more closely match the 'worse case scenario' that we're supposed to keep to (we really need some way to automate this......)

This will prevent the regression on pre-AVX512 targets, but on Skylake etc. with better trunc/shift ops the SLP will still produce the partial vectorization nightmare which the load/store combines can't recover from.

So we're probably at the limit of what we can achieve with cost model accuracy alone.

If we vectorized the entire thing, we might be able to handle a store(trunc(lshr(broadcast(),c))) pattern in store combining - but that would require us finally fixing [Bug #​31572].

Also, I'm not sure whether this should be a 12.x blocker - this is just a performance regression, not a correctness regression (not sure what the exact policy is for this?).

efriedma-quic commented 3 years ago

I guess you could look at this from two different perspectives:

  1. SLP vectorization is messing up the pattern; we should avoid vectorization here.
  2. SLP vectorization should be improved to vectorize this pattern well. Then we can convert the resulting sequence to use scalar registers if appropriate.

If we're looking at SLP vectorization itself, I see the following issues:

  1. There's a general issue that SLP misses patterns involving zero (here, "val >> 0").
  2. We should recognize that the shifts are equivalent to a shuffle of some sort.
  3. The cost modeling is a bit dubious. See https://godbolt.org/z/3P1oWW6nc .
LebedevRI commented 3 years ago

Not to dismiss the bug, but I'm curious if using __builtin_bswap64() is an option? Or is that not portable enough to other compilers?

Note that we may have made SLP less likely to avoid this pattern with the fix for bug 50256 ( https://reviews.llvm.org/rG49950cb1f6f6 )...although this isn't quite the same - there are no loads to combine here; this is just truncates, shifts, and stores.

I'm sympathetic with the bugreport. We could have ended up with this pattern after loop unrolling, where the load was templated over the type size.

I think we should be making a much better effort to match bswap intrinsics, and certainly we shouldn't prevent them from being detected.

rotateright commented 3 years ago

Not to dismiss the bug, but I'm curious if using __builtin_bswap64() is an option? Or is that not portable enough to other compilers?

Note that we may have made SLP less likely to avoid this pattern with the fix for bug 50256 ( https://reviews.llvm.org/rG49950cb1f6f6 )...although this isn't quite the same - there are no loads to combine here; this is just truncates, shifts, and stores.

RKSimon commented 3 years ago

Godbolt: https://godbolt.org/z/4b4dcf9T9

LebedevRI commented 3 years ago

Looks like SLPVectorizer is acting up again, for 32/64 -bit cases. https://godbolt.org/z/csrh5x1aM

SoapGentoo commented 3 years ago

https://godbolt.org/z/zfrc65YfW

GCC 11.1 and Clang 11 fine, Clang 12 is off.

LebedevRI commented 3 years ago

Please can you post standalone reproducers via godbolt links?