Open llvmbot opened 5 years ago
llvm/llvm-project#40031 is fixed now (https://godbolt.org/z/ern9kc), leaving us with
count(bool const*, int):
pxor xmm0, xmm0
xor eax, eax
movdqa xmm1, xmmword ptr [rip + .LCPI0_0] # xmm1 = <1,1,1,1,u,u,u,u,u,u,u,u,u,u,u,u>
.LBB0_1:
movd xmm2, dword ptr [rdi + rax] # xmm2 = mem[0],zero,zero,zero
pxor xmm2, xmm1
pmovzxbd xmm2, xmm2 # xmm2 = xmm2[0],zero,zero,zero,xmm2[1],zero,zero,zero,xmm2[2],zero,zero,zero,xmm2[3],zero,zero,zero
paddd xmm0, xmm2
add rax, 4
cmp rax, 100
jne .LBB0_1
... efficient pshufd hsum of dwords
ret
As discussed previously, special casing for small loop counts that make it impossible for paddb
to overflow could be a ~4x speedup, and allow a more efficient cleanup with psadbw for unsigned byte hsums. Or various other tricks for general-case lengths.
This introduces a new missed optimization vs. clang4.0.1, which used to use pmovzx with a memory source.
pmovzxbd xmm3, dword ptr [rdi + rax]
PMOVZX with a memory source can micro-fuse, saving a uop for the front-end on Sandybridge-family vs. movd load and a separate pmovzxbd xmm,xmm
. https://www.uops.info/html-tp/SKL/PMOVZXBD_XMM_M32-Measurements.html shows that its 2 uops can micro-fuse (RETIRE_SLOTS: 1.0) on Skylake for example. The YMM destination version can't for some reason (https://www.uops.info/html-instr/VPMOVZXBD_YMM_M64.html) but we should take advantage of the XMM version when we can. (Even with an indexed addressing mode for the non-VEX version, or for VEX 128-bit only with a non-indexed addressing mode.)
We fail to compress the XOR constant, still padding the .rodata vector with 12 bytes of zeros and loading with movdqa instead of movd
. So we have the worst of both worlds: extra uops in the loop and a larget constant.
Loading 8 or 16 bytes at a time with movq or movdqu to flip all of them with one pxor, then punpcklbw / hbw might even be a win. But probably not if we're widening all the way to int
: that would take 6 shuffle uops per 4 vectors of PADDD input vs. just 4. 4-byte loads are cheap and micro-fusion can get them through the front-end for free as part of a pmovzx, and pxor is also cheap.
So low/high unpacking against zero is probably only a win if we notice we can use paddw. And if we're going to do any noticing of anything like that, even better to notice we can psadbw -> paddd, effectively using psadbw as an unpack and hsum.
Extended Description
Same code as llvm/llvm-project#40031 , but this bug is about the other major optimizations that are possible, not the silly extra shuffles:
(adapted from: https://stackoverflow.com/questions/54618685/what-is-the-meaning-use-of-the-movzx-cdqe-instructions-in-this-code-output-by-a)
At best, once Bug 40685 is fixed, clang / LLVM is probably doing something like
This costs us a shuffle and 2 other ALU instructions per 4 bools, regardless of unrolling.
Other missed optimizations
For large arrays, we only need ~1.25 vector ALU instruction (and a pure load) plus loop overhead per 16 / 32 / 64 bools (1 per vector with an extra ALU amortized over a minor unroll by 4). See below.
Instead of flipping inside the loop, we can do
We can prove that this can't overflow notcount, because len is small enough and bool->int can only be 0 or 1.
Byte counters won't overflow for a compile-time-constant array size of 100, so we can simply sum outside the loop.
Probably actually best to vextracti128 + paddb first, then psadbw xmm (if we care about Excavator / Ryzen), but that would make overflow of byte elements possible for sizes half as large. If we're using that hsum of bytes as a canned sequence, probably easiest to have one that works for all sizes up to 255 vectors. It's a minor difference, like 1 extra ALU uop.
32 * 255 = 8160 fits in 16 bits, so it really doesn't matter what SIMD element size we use on the result of psadbw. PADDQ is slower than PADDD/W/B on Atom/Silvermont including Goldmont Plus (non-AVX CPUs only), so we might as well avoid it so the same auto-vectorization pattern is good with just SSE2 / SSE4 on those CPUs.
Signed int overflow is undefined behaviour, but this is counting something different than the source; the C abstract machine never has a sum of the true elements in an
int
. So we'd better only do it in a way that can't overflow.For unknown array sizes, we can use psadbw inside the inner loop. e.g.
(Without AVX, we'd either need an alignment guarantee or separate movdqu loads, so for large len reaching an alignment boundary could become valuable to avoid front-end bottlenecks.)
Or if we can't / don't want to sink the boolean inversion out of the loop, we can start with a vector of set1_epi8( 4 ) and do
Or other even-less optimal ways of flipping inside the loop before summing.
In the general case of counting compare results, vpcmpeqb / vpsubb is good.
For hot loops we can put off psadbw for up to 255 iterations (254 vpaddb) without overflowing a byte sum, but branch misses for inner-loop counts over ~22 on Skylake probably make that what we should aim for. Probably with 2 accumulators inside an inner loop to hide latency and maybe allow 2x load+add per clock. More than 2 accumulators probably isn't worth it for a short inner loop; we're breaking that short inner-loop dep chain every outer-loop iteration so out-of-order exec can overlap the end of one with the start of the next and find ILP that way, if imperfect scheduling delays the inner loop. (Of course at that point you're probably not running from L1d anymore and we're looking at memory delays.)
Anyway, see https://stackoverflow.com/questions/54541129/how-to-count-character-occurrences-using-simd for an example of this. (Where we get a 0 or -1 vector from pcmpeqb compare result).
Even less optimal would be to PXOR + PSADBW for every vector of bytes, into a vector of qword or dword accumulators.
Then you'd have fully-standard auto-vectorization with counter vectors having the same element width as the C variable, and the same contents, so you wouldn't have to prove anything about whether it overflowed or not.
But as a first step, that might be easier to teach llvm about?
Plus it works for summing any uint8_t, not taking advantage of bool (or compare results) being able to be added (subtracted) without overflow for multiple iterations.
PSADBW is single-uop on pretty much everything from Merom onwards, even Atom and P4; it's critical for video encoding motion search where it's used with both inputs being variable, not constant zero.
e.g. on Haswell it's 1 uop for p0 only, 5 cycle latency. On SKL it moved to P5 and is down to 3c latency, but competes with shuffles.
KNL has slow VPSADBW YMM, but even its VPSADBW XMM is not bad. (It doesn't have AVX512BW, so it doesn't have VPSADBW ZMM.) KNL is slow for a lot of integer AVX2 YMM instructions that it doesn't have a 512-bit version of.
Horizontal sums of signed bytes (int8_t) without overflow can be done by range-shifting to unsigned with XOR to flip the high bits, then PSADBW, then subtract 16 * 0x80 or whatever at the very end to undo the bias.
This is definitely worth it vs. signed unpacking with pmovsx or worse, punpck + arithmetic shifts to manually do the sign extension.