`-O` division is silly compared to `-Osize`

swiftlang / swift

The Swift Programming Language

https://swift.org

Apache License 2.0

67.54k stars 10.35k forks source link

`-O` division is silly compared to `-Osize` #75611

Open oscbyspro opened 2 months ago

oscbyspro commented 2 months ago

Description

No response

Reproduction

Given this function:

func div(_ a: UInt, _ b: UInt) -> UInt {
    a / b
}

Swift 5.10, Godbolt, -O:

output.div(Swift.UInt, Swift.UInt) -> Swift.UInt:
        test    rsi, rsi
        je      .LBB1_5
        mov     rax, rdi
        or      rax, rsi
        shr     rax, 32
        je      .LBB1_2
        mov     rax, rdi
        xor     edx, edx
        div     rsi
        ret
.LBB1_2:
        mov     eax, edi
        xor     edx, edx
        div     esi
        ret
.LBB1_5:
        ud2

Swift 5.10, Godbolt, -Osize:

output.div(Swift.UInt, Swift.UInt) -> Swift.UInt:
        test    rsi, rsi
        je      .LBB1_2
        mov     rax, rdi
        xor     edx, edx
        div     rsi
        ret
.LBB1_2:
        ud2

Expected behavior

I thought -O would be more serious about it.

Environment

x86-64 swiftc 5.10 on https://swift.godbolt.org

Additional information

No response

tbkka commented 2 months ago

Why do you believe the first version is slower? What timing tests have you done? On what processors?

If I'm reading the assembly correctly, the -O version is using a 32-bit division operation when both arguments are less than 32 bits, while the -Osize version always uses the 64-bit division. It makes sense to me that the first version could be faster on processors where 32-bit division is much faster than 64-bit, but I've not done any timing measurements myself.

oscbyspro commented 2 months ago

Feel free to disregard this issue if the -O version performs better.

jckarter commented 2 months ago

It's LLVM that makes this codegen decision (clang will do the same thing), so you could also raise the issue on the LLVM bug tracker if you find the code generation to be less-than-ideal for your use case. From my armchair optimizer's seat, it seems like it would at least be marginally better to do a jne .LBB1_2 and fall into the probably-more-likely 32-bit divisor case rather than jump away to it.