rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
94.7k stars 12.2k forks source link

Suboptimal codegen for ARM32 targets when performing offset load #125386

Open zesterer opened 1 month ago

zesterer commented 1 month ago

rustc generates suboptimal code on 32-bit ARM targets when performing a load from a base + offset pointer. This seems to be a general issue, rearing its head in a number of programs I've written, including trivial examples.

Since this pattern - loading from a non-constant address that's been offset by an index - is very common in real code and in particular inner loops, I'd be surprised if this doesn't have a non-trivial impact on the performance of real code.

Note that LLVM doesn't seem to exhibit this poor behaviour on aarch64 (ARM 64) targets.

unsafe fn read(src: *const u16, n: usize) -> u16 {
    src.byte_add(n).read()
}

produces

read:
        add     r0, r0, r1
        ldrh    r0, [r0]
        bx      lr

I'd expect it to produce

read:
        ldrh   r0, [r0, r1]
        bx      lr

as GCC does. I believe that on many targets (at the very least, armv4) the latter is always faster than the former.

Note that this is an issue with LLVM: Clang also exhibits this poor code generation.

Rust (rustc, bad): https://godbolt.org/z/7oPe8crM7 C (Clang, bad): https://godbolt.org/z/4M9E7Kh91 C (GCC, good): https://godbolt.org/z/639cxxKc8

I've not been able to test Rust's new GCC backend since I've not been able to work out how to tell it to generate code for ARM32 targets.

rustc --version --verbose:

rustc 1.79.0-nightly (8b2459c1f 2024-04-09)
RossSmyth commented 1 month ago

I've not been able to test Rust's new GCC backend since I've not been able to work out how to tell it to generate code for ARM32 targets.

The GCC backend is not a cross-compiler so you'll have to compile it for each target you would like to compile for.