Clang is generating addresses using add instructions, but this is unnecessary. It could use repeated post-index immediate offsets to march the pointer forward in memory. This is apparently safe and does not incur a performance penalty on Mac ARM CPUs. I am told it has a performance penalty only on the Cortex A55, which is a CPU that has never been used in any Apple device. Even if it were slower, this would generate smaller code, which is what -Oz is designed to do.
For even smaller code, Clang could even leverage ld4r to load all four scalars at once. In this case we have three fewer instructions, and wouldn't even need offsets at all.
Consider the following loop, which copies scalar data into vectors: https://godbolt.org/z/E38feYWPd
Clang is generating addresses using add instructions, but this is unnecessary. It could use repeated post-index immediate offsets to march the pointer forward in memory. This is apparently safe and does not incur a performance penalty on Mac ARM CPUs. I am told it has a performance penalty only on the Cortex A55, which is a CPU that has never been used in any Apple device. Even if it were slower, this would generate smaller code, which is what -Oz is designed to do.
This approach would save two instructions:
For even smaller code, Clang could even leverage
ld4r
to load all four scalars at once. In this case we have three fewer instructions, and wouldn't even need offsets at all.