[AArch64] On Neoverse V2, transform ld4 into ld2 + uzp* sequences

mingmingl-llvm commented 3 months ago

https://godbolt.org/z/K17nh31oG shows that ld4 instruction could be transformed into ld2 + uzp* sequences and give equivalent program output (at least on little-endian systems)

According to Neoverse V2 SOG (https://developer.arm.com/documentation/109898/latest/) section 4.16, The bandwidth of the following ASIMD and SVE instructions is limited by decode constraints and it is advisable to avoid them when high performing code is desired.

Would it make sense to do this transformation in the compiler (say in the instruction selection phase) for Neoverse V2? Microbenchmark gives measurable throughput increase. On the other hand, it may increase instruction cache pressure.

llvmbot commented 3 months ago

@llvm/issue-subscribers-backend-aarch64

Author: Mingming Liu (minglotus-6)

https://godbolt.org/z/K17nh31oG shows that `ld4` instruction could be transformed into `ld2 + uzp*` sequences and give equivalent program output (at least on little-endian systems) According to Neoverse V2 SOG (https://developer.arm.com/documentation/109898/latest/) section 4.16, _The bandwidth of the following ASIMD and SVE instructions is limited by decode constraints and it is advisable to avoid them when high performing code is desired._ Would it make sense to do this transformation in the compiler (say in the instruction selection phase) for Neoverse V2? Microbenchmark gives measurable throughput increase. On the other hand, it may increase instruction cache pressure.

mingmingl-llvm commented 3 months ago

cc @davemgreen for thoughts, thanks!

mingmingl-llvm commented 3 months ago

Similarly, the same section in neoverse v2 software optimization guide advises against st4. Wonder if it makes general sense for performance to transform st4 into a series of st2 + zip* instructions like https://godbolt.org/z/ch4nccf4x?

If the compile options indicate a function should be optimized for size, the transformation from ld4/st4 to a series of instructions shouldn't took place.

sjoerdmeijer commented 3 months ago

We have also been looking at LD3/LD4s recently. I generally agree with the analysis here. The LLVM-MCA report shows that the store can start earlier. Effectively what we do with the expanded sequence is to break up dependencies, expose more ILP so that some of the instructions can start a bit earlier. In case of the LD4, all dependent stores are stalled on completion of the LD4, and I was just wondering if that is accurate and if this is what actually happens in hardware. In other words, I am wondering if some of the destination registers are not available earlier. Maybe we could figure this out with some nano-benchmarks.

Expanding one LD4 into 2 LD2s and 4 unzips is not really pretty, has some disadvantages as you also mentioned, but if it is faster....? Maybe we need to benchmark this more, and see if this is really a win?

sjoerdmeijer commented 3 months ago

And adding one more thought here: the examples uses intrinsics. Maybe there is a difference if we generate LD3/LD4s from intrinsics or recognise and emit them from C/C++. Maybe there is an expectation that LD4 will be emitted if the intrinsic is used, but perhaps less so from source code.

davemgreen commented 3 months ago

I suspect it will depend on the specific case whether things get better or worse. I agree that testing it to be sure it probably best.

The large complex instructions like these are split into multiple uops during decode, similar to the individual instructions (multiple load and zip operations). This puts constraints on how many can be decoded, which I believe is what the SWOG is referring to. Emitting multiple instructions could be even worse if the limit is the number of instructions, not the number of complex decoders.

LD4/ST4 often come together in groups in the same loops though. It might be worth expanding them if there are multiple of them in close proximity.

mingmingl-llvm commented 3 months ago

Thank you both for the comment!

I'd agree that benchmarking/testing is the best way forward to know whether transforming ld4/st4 to instruction sequences is generally useful; I also agree that it will depend on the specific case whether things get better or worse.

Currently there isn't a way to get visibility into whether ld4/st4 are emitted and the cycles spent. There are ongoing internal efforts to make getting visibility much easier, and I plan to find out the expensive ld4/st4 by then. I'll recommend internal users to rewrite intrinsic function and benchmarking to make the best choice.

LD4/ST4 often come together in groups in the same loops though. It might be worth expanding them if there are multiple of them in close proximity.

LD4/ST4 also come together in the motivating use case.

Maybe there is an expectation that LD4 will be emitted if the intrinsic is used, but perhaps less so from source code.

Hmm, I haven't thought about the source code expectation. I guess compiler has the freedom to generate obviously more efficient code sequence (for example in DAG combiner) for a given intrinsic function; but less freedom (if any) to trade-off between {increased ILP, reduced instruction decoding pressure} and lower icache pressure.

davemgreen commented 3 months ago

LD4/ST4 often come together in groups in the same loops though. It might be worth expanding them if there are multiple of them in close proximity.

I might have been incorrect about whether it matters that there are multiple of them together. It might be enough for a single one to hold-up decoding until other operations can continue.

llvm / llvm-project

[AArch64] On Neoverse V2, transform ld4 into ld2 + uzp* sequences #103481