llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.87k stars 11.92k forks source link

ARM: -Oz and -O3 avoid post-index immediate offset instructions unnecessarily #63833

Open johnstiles-google opened 1 year ago

johnstiles-google commented 1 year ago

Consider the following loop, which copies scalar data into vectors: https://godbolt.org/z/E38feYWPd

Clang is generating addresses using add instructions, but this is unnecessary. It could use repeated post-index immediate offsets to march the pointer forward in memory. This is apparently safe and does not incur a performance penalty on Mac ARM CPUs. I am told it has a performance penalty only on the Cortex A55, which is a CPU that has never been used in any Apple device. Even if it were slower, this would generate smaller code, which is what -Oz is designed to do.

This approach would save two instructions:

    add     x8, x0, w1, uxtw
    add     x11, x0, x1, lsr #32
    ld1r    { v0.4s }, [x8], #4
    ld1r    { v1.4s }, [x8], #4
    ld1r    { v2.4s }, [x8], #4
    ld1r    { v3.4s }, [x8]
    stp     q0, q1, [x11]
    stp     q2, q3, [x11, #32]
    ret

For even smaller code, Clang could even leverage ld4r to load all four scalars at once. In this case we have three fewer instructions, and wouldn't even need offsets at all.

llvmbot commented 1 year ago

@llvm/issue-subscribers-backend-aarch64