Open Kmeakin opened 3 months ago
@llvm/issue-subscribers-backend-aarch64
Author: Karl Meakin (Kmeakin)
This does however come with slightly lower register pressure. Is that a tradeoff worth making?
This does however come with slightly lower register pressure. Is that a tradeoff worth making?
On many aarch64 processors, ldp x*
has the same latency as one ldr x*
. So it is not just lower register pressure. And if you had 3 load units, you could get slightly better performance with the ldp
version. I highly doubt it is that much lower.
Currently, on AArch64 ldp formation runs after register allocation, so it can fail if the registers overlap with other instructions.
I thought we had some handling for this, but I guess we don't. 32-bit ARM has a dedicated pre-RA pass, but integrating with the scheduler is probably simpler. (See AArch64MacroFusion.cpp.)
The impact on register pressure obviously depends on how far you're moving the instruction... short distances like this example are unlikely to have a significant impact. It might be a concern over longer distances.
Note GCC starting in GCC 14 has 2 passes to do the LDP/STP fusion, one before right before register allocation (after the scheduling pass before RA) and one after register allocation (after the scheduling for fusion pass but before the main scheduling pass after RA) . GCC 13 and before was done as part of the standard peephole pass which is done right after the scheduling for fusion pass (which is after RA).
https://godbolt.org/z/jvxE5K9sW
add3
loads each ofx
,y
, andz
with theldp
instruction, butmul3
andmul_add
splits the load ofy
into twoldr
s instead of using a singleldp
. GCC has no such issue