Suboptimal codegen for `sdiv` by constants

Kmeakin commented 1 month ago

GCC is able to generate shorter code for signed division by non-power of 2 constants than clang: https://godbolt.org/z/axeTh4qb8

AArch64, clang:

sdiv3:                                  // @sdiv3
        mov     w8, #21846                      // =0x5556
        movk    w8, #21845, lsl #16
        smull   x8, w0, w8
        lsr     x9, x8, #63
        lsr     x8, x8, #32
        add     w0, w8, w9
        ret

AArch64, GCC:

sdiv3:
        mov     w1, 21846
        movk    w1, 0x5555, lsl 16
        smull   x1, w0, w1
        lsr     x1, x1, 32
        sub     w0, w1, w0, asr 31
        ret

x86-64, clang:

sdiv3:                                  # @sdiv3
        movsxd  rax, edi
        imul    rax, rax, 1431655766
        mov     rcx, rax
        shr     rcx, 63
        shr     rax, 32
        add     eax, ecx
        ret

x86-64, GCC:

sdiv3:
        movsx   rax, edi
        sar     edi, 31
        imul    rax, rax, 1431655766
        shr     rax, 32
        sub     eax, edi
        ret

llvmbot commented 1 month ago

@llvm/issue-subscribers-backend-x86

Author: Karl Meakin (Kmeakin)

GCC is able to generate shorter code for signed division by non-power of 2 constants than clang: https://godbolt.org/z/axeTh4qb8 AArch64, clang: ```asm sdiv3: // @sdiv3 mov w8, #21846 // =0x5556 movk w8, #21845, lsl #16 smull x8, w0, w8 lsr x9, x8, #63 lsr x8, x8, #32 add w0, w8, w9 ret ``` AArch64, GCC: ```asm sdiv3: mov w1, 21846 movk w1, 0x5555, lsl 16 smull x1, w0, w1 lsr x1, x1, 32 sub w0, w1, w0, asr 31 ret ``` x86-64, clang: ```asm sdiv3: # @sdiv3 movsxd rax, edi imul rax, rax, 1431655766 mov rcx, rax shr rcx, 63 shr rax, 32 add eax, ecx ret ``` x86-64, GCC: ```asm sdiv3: movsx rax, edi sar edi, 31 imul rax, rax, 1431655766 shr rax, 32 sub eax, edi ret ```

llvmbot commented 1 month ago

@llvm/issue-subscribers-backend-aarch64

Author: Karl Meakin (Kmeakin)

GCC is able to generate shorter code for signed division by non-power of 2 constants than clang: https://godbolt.org/z/axeTh4qb8 AArch64, clang: ```asm sdiv3: // @sdiv3 mov w8, #21846 // =0x5556 movk w8, #21845, lsl #16 smull x8, w0, w8 lsr x9, x8, #63 lsr x8, x8, #32 add w0, w8, w9 ret ``` AArch64, GCC: ```asm sdiv3: mov w1, 21846 movk w1, 0x5555, lsl 16 smull x1, w0, w1 lsr x1, x1, 32 sub w0, w1, w0, asr 31 ret ``` x86-64, clang: ```asm sdiv3: # @sdiv3 movsxd rax, edi imul rax, rax, 1431655766 mov rcx, rax shr rcx, 63 shr rax, 32 add eax, ecx ret ``` x86-64, GCC: ```asm sdiv3: movsx rax, edi sar edi, 31 imul rax, rax, 1431655766 shr rax, 32 sub eax, edi ret ```

davemgreen commented 1 month ago

Maybe related to #97879. @vfdff FYI.

topperc commented 1 month ago

I don't think this is related to #97879.

The issue here is that gcc and llvm use slightly different code. gcc uses the sign of the original dividend to do the final shift and sub. LLVM uses the sign of the quotient for the final shift and add.

gcc's algorithm requires the subtract to have a different operand order when the divisor is negative. LLVM uses an add so the operand order doesn't matter since add is commutative, but requires a different magic constant for the multiply for negative divisors.

I think switching to the gcc algorithm doesn't work for vector divide if the vector divisors have different signs since we can't change the operand order of the final subtract on a per element basis.

vfdff commented 1 month ago

hi @topperc Now the sdiv is expansed into instruction chain in ISel, which is placed later than LoopVectorizer and SLPVectorizer, so it will not affect vector divide if switching to the gcc algorithm?

llvm / llvm-project

Suboptimal codegen for `sdiv` by constants #97909