USRA is replaced with USHR+ORR which results in poor codegen

llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

http://llvm.org

Other

27.92k stars 11.53k forks source link

USRA is replaced with USHR+ORR which results in poor codegen #48921

Open danlark1 opened 3 years ago

danlark1 commented 3 years ago


Bugzilla Link	49577
Version	trunk
OS	Linux
CC	@Arnaud-de-Grandmaison-ARM,@DMG862,@smithp35

Extended Description

int MoveMask(uint8x16_t input) { uint16x8_t high_bits = vreinterpretq_u16_u8(vshrq_n_u8(input, 7)); uint32x4_t paired16 = vreinterpretq_u32_u16(vsraq_n_u16(high_bits, high_bits, 7)); uint64x2_t paired32 = vreinterpretq_u64_u32(vsraq_n_u32(paired16, paired16, 14)); uint8x16_t paired64 = vreinterpretq_u8_u64(vsraq_n_u64(paired32, paired32, 28)); return vgetq_lane_u8(paired64, 0) | ((int) vgetq_lane_u8(paired64, 8) << 8); }

Generates for vsraq_n_u16 and vsraq_n_u32 USHR+ORR instead of USRA like GCC does

https://gcc.godbolt.org/z/MxP63x

Also in Match function there are two redundant AND with 0xff

Match(unsigned char): // @Match(unsigned char) adrp x8, ctrl ldr q0, [x8, :lo12:ctrl] dup v1.16b, w0 cmeq v0.16b, v1.16b, v0.16b movi v1.16b, #1 and v0.16b, v0.16b, v1.16b ushr v1.8h, v0.8h, #7 orr v0.16b, v1.16b, v0.16b ushr v1.4s, v0.4s, #14 orr v0.16b, v1.16b, v0.16b usra v0.2d, v0.2d, #28 umov w8, v0.b[0] umov w9, v0.b[8] and x0, x8, #0xff // Not needed? and x8, x9, #0xff // Not needed? bfi x0, x8, #8, #8 ret

danlark1 commented 3 years ago

Okay, it is even more interesting

Removing first shrq_n_u8 turns the code into the good one

int MoveMask(uint8x16_t input) { uint32x4_t paired16 = vreinterpretq_u32_u16(vsraq_n_u16(input, input, 7)); uint64x2_t paired32 = vreinterpretq_u64_u32(vsraq_n_u32(paired16, paired16, 14)); uint8x16_t paired64 = vreinterpretq_u8_u64(vsraq_n_u64(paired32, paired32, 28)); return vgetq_lane_u8(paired64, 0) | ((int) vgetq_lane_u8(paired64, 8) << 8); }

        usra    v0.8h, v0.8h, #&#8203;7
        usra    v0.4s, v0.4s, #&#8203;14
        usra    v0.2d, v0.2d, #&#8203;28
        umov    w0, v0.b[0]
        umov    w8, v0.b[8]
        bfi     w0, w8, #&#8203;8, #&#8203;8
        ret

And actually there are tests on that in AArch64/arm64-vsra.ll to check that. I believe it comes from DAGCombiner::visitAdd and it can prove that + is the same as or in

if ((!LegalOperations || TLI.isOperationLegal(ISD::OR, VT)) && DAG.haveNoCommonBitsSet(N0, N1)) return DAG.getNode(ISD::OR, DL, VT, N0, N1);

Yeah, we probably need something like AddLikeOrOp

danlark1 commented 3 years ago

Let me bump this once again, new zstd 1.5.0 lost performance with the new update because of this

https://github.com/facebook/zstd/pull/2494#issuecomment-829631487

I can try to fix on my own, just need some guidance probably

4c958a03-4ad7-407c-a85b-320d9a7221b0 commented 3 years ago

Those instructions do indeed look redundant. I've put up https://reviews.llvm.org/D98599 to improve the lowering.

The ushr+orr I haven't looked at in any depth, but llvm is likely being helpful by turning the Add into an Or. That needs to be turned back using something like IsOrAdd or AddLikeOrOp used in other backends (although known bits may be needed for the aarch64 shifts to make that work).