ARM: -Oz and -O3 avoid post-index immediate offset instructions unnecessarily

Consider the following loop, which copies scalar data into vectors: https://godbolt.org/z/E38feYWPd

Clang is generating addresses using add instructions, but this is unnecessary. It could use repeated post-index immediate offsets to march the pointer forward in memory. This is apparently safe and does not incur a performance penalty on Mac ARM CPUs. I am told it has a performance penalty only on the Cortex A55, which is a CPU that has never been used in any Apple device. Even if it were slower, this would generate smaller code, which is what -Oz is designed to do.

This approach would save two instructions:

    add     x8, x0, w1, uxtw
    add     x11, x0, x1, lsr #32
    ld1r    { v0.4s }, [x8], #4
    ld1r    { v1.4s }, [x8], #4
    ld1r    { v2.4s }, [x8], #4
    ld1r    { v3.4s }, [x8]
    stp     q0, q1, [x11]
    stp     q2, q3, [x11, #32]
    ret

For even smaller code, Clang could even leverage ld4r to load all four scalars at once. In this case we have three fewer instructions, and wouldn't even need offsets at all.

llvm / llvm-project

ARM: -Oz and -O3 avoid post-index immediate offset instructions unnecessarily #63833