rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
98.04k stars 12.69k forks source link

Bizarre stack usage when working with u128 on T32 targets #117324

Closed ketsuban closed 11 months ago

ketsuban commented 1 year ago

I read a Wikipedia page that mentions a "contrived 32-bit shift" of a 128-bit integer and I thought of Rust's native 128-bit integer support, so I decided to see how efficiently it uses registers on the platform I'm most used to at this point: ARM.

pub fn rotate(num: u128) -> u128 {
    num.rotate_left(32)
}

For didactic purposes I'll start with arm-unknown-linux-gnueabi.

    mov r12, r2
    mov r2, r1
    mov r1, r0
    mov r0, r3
    mov r3, r12
    bx lr

The ARM ABI has four registers which functions are allowed to clobber so it cleanly makes use of all of them. What about when there isn't a fourth scratch register, though? I already have a setup which uses thumbv4t-none-eabi, I expect it'll push one callee-saved register to the stack and use it as a fourth scratch register instead?

    push {r4, lr}
    movs r4, r2
    movs r2, r1
    movs r1, r0
    movs r0, r3
    movs r3, r4
    ldr r4, [sp, #0x4]
    mov lr, r4
    pop {r4}
    add sp, #0x4
    bx lr

That's less good. It does push r4 like I expected, but there's no reason for it to ever touch lr, and then it has a small fit in the function epilogue.

This is a synthetic problem because I have no reason to ever use a 128-bit integer like this (even for pseudorandom number generation there are algorithms that operate on four 32-bit values individually rather than needing 128-bit operations) but I care about Rust being the best it can be and LLVM is clearly having a time here.

LunarLambda commented 11 months ago

The reason it pushes lr is because the 32-bit ARM ABI ("AAPCS") dictates the stack always be aligned to 8 bytes, and lr is the highest register accessible to the THUMB push instruction, and commonly needs to be pushed anyway for subroutine calls.

I believe the oddball epilogue is because all 4 scratchable registers (r0-r3) are used for the return value. So it has to restore lr through r4 first, then restore r4.

Normally, if you had a free register, it would do something like pop {rX}; pop {r4}; bx rX, where the saved value of LR gets popped into some free register (since THUMB pop can only pop pc, not lr, which can't be used as a return on ARMv4T due to lack of interworking on pc writes).

So, while LLVM's codegen particularly for v4 THUMB is often suboptimal, I don't think LLVM is really doing anything wrong here, even if it looks very strange.

EDIT: I was going to exemplify this using a godbolt link for C, but it turns out neither GCC or Clang support __int128 on ARM targets, and the thumbv4t-none-eabi can't be used on godbolt due to -Z build-std=core being needed. Regardless way I'm confident this is simply an architectural limitation and not something LLVM can do anything about.

LunarLambda commented 11 months ago

Also, I think the use of the word "demented" is unnecessary, comes off as hostile, and is arguably ableist. I understand being confused or frustrated about suboptimal (or seemingly suboptimal) codegen but I don't think it necessitates such language.

ketsuban commented 11 months ago

The reason it pushes lr is because the 32-bit ARM ABI ("AAPCS") dictates the stack always be aligned to 8 bytes

I hate when ABIs force nonsensical codegen. It's a 32-bit platform, the stack shouldn't need greater than 4-byte alignment. Oh well. Guess I'll close this one as "reality is a disappointment".

Also, I think the use of the word "demented" is unnecessary, comes off as hostile, and is arguably ableist.

It didn't even occur to me that the term had any link with mental health; I'll try to choose my words more carefully. ("Hostile", though?)