Open danlark1 opened 3 years ago
Okay, it is even more interesting
Removing first shrq_n_u8 turns the code into the good one
int MoveMask(uint8x16_t input) { uint32x4_t paired16 = vreinterpretq_u32_u16(vsraq_n_u16(input, input, 7)); uint64x2_t paired32 = vreinterpretq_u64_u32(vsraq_n_u32(paired16, paired16, 14)); uint8x16_t paired64 = vreinterpretq_u8_u64(vsraq_n_u64(paired32, paired32, 28)); return vgetq_lane_u8(paired64, 0) | ((int) vgetq_lane_u8(paired64, 8) << 8); }
usra v0.8h, v0.8h, #​7
usra v0.4s, v0.4s, #​14
usra v0.2d, v0.2d, #​28
umov w0, v0.b[0]
umov w8, v0.b[8]
bfi w0, w8, #​8, #​8
ret
And actually there are tests on that in AArch64/arm64-vsra.ll to check that. I believe it comes from DAGCombiner::visitAdd and it can prove that + is the same as or in
if ((!LegalOperations || TLI.isOperationLegal(ISD::OR, VT)) && DAG.haveNoCommonBitsSet(N0, N1)) return DAG.getNode(ISD::OR, DL, VT, N0, N1);
Yeah, we probably need something like AddLikeOrOp
Let me bump this once again, new zstd 1.5.0 lost performance with the new update because of this
https://github.com/facebook/zstd/pull/2494#issuecomment-829631487
I can try to fix on my own, just need some guidance probably
Those instructions do indeed look redundant. I've put up https://reviews.llvm.org/D98599 to improve the lowering.
The ushr+orr I haven't looked at in any depth, but llvm is likely being helpful by turning the Add into an Or. That needs to be turned back using something like IsOrAdd or AddLikeOrOp used in other backends (although known bits may be needed for the aarch64 shifts to make that work).
Extended Description
int MoveMask(uint8x16_t input) { uint16x8_t high_bits = vreinterpretq_u16_u8(vshrq_n_u8(input, 7)); uint32x4_t paired16 = vreinterpretq_u32_u16(vsraq_n_u16(high_bits, high_bits, 7)); uint64x2_t paired32 = vreinterpretq_u64_u32(vsraq_n_u32(paired16, paired16, 14)); uint8x16_t paired64 = vreinterpretq_u8_u64(vsraq_n_u64(paired32, paired32, 28)); return vgetq_lane_u8(paired64, 0) | ((int) vgetq_lane_u8(paired64, 8) << 8); }
Generates for vsraq_n_u16 and vsraq_n_u32 USHR+ORR instead of USRA like GCC does
https://gcc.godbolt.org/z/MxP63x
Also in Match function there are two redundant AND with 0xff
Match(unsigned char): // @Match(unsigned char) adrp x8, ctrl ldr q0, [x8, :lo12:ctrl] dup v1.16b, w0 cmeq v0.16b, v1.16b, v0.16b movi v1.16b, #1 and v0.16b, v0.16b, v1.16b ushr v1.8h, v0.8h, #7 orr v0.16b, v1.16b, v0.16b ushr v1.4s, v0.4s, #14 orr v0.16b, v1.16b, v0.16b usra v0.2d, v0.2d, #28 umov w8, v0.b[0] umov w9, v0.b[8] and x0, x8, #0xff // Not needed? and x8, x9, #0xff // Not needed? bfi x0, x8, #8, #8 ret