[AArch64] Failure to combine 2 `ldr`s into `ldp`

llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

http://llvm.org

Other

29.07k stars 11.98k forks source link

[AArch64] Failure to combine 2 `ldr`s into `ldp` #100807

Open Kmeakin opened 3 months ago

Kmeakin commented 3 months ago

https://godbolt.org/z/jvxE5K9sW

typedef __uint128_t u128;

u128 add3(u128 *x, u128 *y, u128 *z) { return *x + *y + *z; }
u128 mul3(u128 *x, u128 *y, u128 *z) { return *x * *y * *z; }
u128 mul_add(u128 *x, u128 *y, u128 *z) { return *x * *y + *z; }

add3 loads each of x, y, and z with the ldp instruction, but mul3 and mul_add splits the load of y into two ldrs instead of using a single ldp. GCC has no such issue

llvmbot commented 3 months ago

@llvm/issue-subscribers-backend-aarch64

Author: Karl Meakin (Kmeakin)

https://godbolt.org/z/jvxE5K9sW ```c typedef __uint128_t u128; u128 add3(u128 *x, u128 *y, u128 *z) { return *x + *y + *z; } u128 mul3(u128 *x, u128 *y, u128 *z) { return *x * *y * *z; } u128 mul_add(u128 *x, u128 *y, u128 *z) { return *x * *y + *z; } ``` `add3` loads each of `x`, `y`, and `z` with the `ldp` instruction, but `mul3` and `mul_add` splits the load of `y` into two `ldr`s instead of using a single `ldp`. GCC has no such issue

tgymnich commented 3 months ago

This does however come with slightly lower register pressure. Is that a tradeoff worth making?

pinskia commented 3 months ago

This does however come with slightly lower register pressure. Is that a tradeoff worth making?

On many aarch64 processors, ldp x* has the same latency as one ldr x*. So it is not just lower register pressure. And if you had 3 load units, you could get slightly better performance with the ldp version. I highly doubt it is that much lower.

efriedma-quic commented 3 months ago

Currently, on AArch64 ldp formation runs after register allocation, so it can fail if the registers overlap with other instructions.

I thought we had some handling for this, but I guess we don't. 32-bit ARM has a dedicated pre-RA pass, but integrating with the scheduler is probably simpler. (See AArch64MacroFusion.cpp.)

The impact on register pressure obviously depends on how far you're moving the instruction... short distances like this example are unlikely to have a significant impact. It might be a concern over longer distances.

pinskia commented 3 months ago

Note GCC starting in GCC 14 has 2 passes to do the LDP/STP fusion, one before right before register allocation (after the scheduling pass before RA) and one after register allocation (after the scheduling for fusion pass but before the main scheduling pass after RA) . GCC 13 and before was done as part of the standard peephole pass which is done right after the scheduling for fusion pass (which is after RA).