Open fa778132-f3d5-4559-b45d-fa683057d467 opened 3 years ago
The big missed optimization here is essentially SLP vectorization: we want to take the four scalar loads, and turn them into a vector load, so we can take advantage of the special fmla encoding.
Not sure why we're forming the ld1r for exactly one of the loads.
The other bit which Florian pointed out is that we're missing one ldr+ldr->ldp fusion because the load/store optimizer is greedy. In practice, I'm not sure this is actually much of a performance loss, but it looks ugly.
One of the current limitation of the load-store optimizer for AArch64 is that it fuses loads/stores eagerly, sometimes preventing further opportunities.
With trunk version of LLVM, the code generation for this case slightly improved. However, the load/store optimization still doesn't appear to be applied.
ldp q2, q3, [x1, #16]
ldp s1, s0, [x0]
ldr q4, [x1, #48]
fmul.4s v0, v2, v0[0]
ldr q2, [x1]
fmla.4s v0, v2, v1[0]
ldp s1, s2, [x0, #8]
fmla.4s v0, v3, v1[0]
fmla.4s v0, v4, v2[0]
ret
https://ispc.godbolt.org/z/6r3q6MYa1
Is there any possibility that this issue will be addressed in upcoming releases?
Extended Description
On Arm64 LLVM produces too many loads that could be fused.
The following llvm-ir comes from a reduced test from https://github.com/ispc/ispc/issues/2052 The bug contains a larger test for matrix multiply in ispc that needs to be fixed as well.
When compiling this llvm-IR with LLVM-11 we get the following code:
Load/store optimization does not seem to catch the pattern. https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
The expected output (hand optimized) is: