[AArch64] The `umlal` instruction that cannot be executed in parallel?

DianQK commented 2 months ago

The following IR has had its instruction order altered after reassociate:

target datalayout = "e-m:o-i64:64-i48:128-n32:64-S128"
target triple = "arm64-apple-macosx11.0.0"

define <2 x i64> @src(ptr %arg, ptr %arg1, i64 noundef %arg2, <2 x i64> %arg3, <2 x i64> %arg4, <2 x i64> %arg5, <4 x i32> %arg6, <4 x i32> %arg7) {
bb:
  %i = shufflevector <4 x i32> %arg6, <4 x i32> poison, <2 x i32> <i32 0, i32 1>
  %i8 = shufflevector <4 x i32> %arg7, <4 x i32> poison, <2 x i32> <i32 0, i32 1>
  %i9 = tail call <2 x i64> @llvm.aarch64.neon.umull.v2i64(<2 x i32> %i, <2 x i32> %i8)
  %i10 = add <2 x i64> %i9, %arg5
  %i11 = shufflevector <4 x i32> %arg6, <4 x i32> poison, <2 x i32> <i32 2, i32 3>
  %i12 = shufflevector <4 x i32> %arg7, <4 x i32> poison, <2 x i32> <i32 2, i32 3>
  %i13 = tail call <2 x i64> @llvm.aarch64.neon.umull.v2i64(<2 x i32> %i11, <2 x i32> %i12)
  %i14 = add <2 x i64> %i13, %arg5
  ; tail call void asm sideeffect alignstack "/* ${0} */", "w,~{cc},~{memory}"(<2 x i64> %i10)
  ; tail call void asm sideeffect alignstack "/* ${0} */", "w,~{cc},~{memory}"(<2 x i64> %i14)
  %i15 = add <2 x i64> %i10, %arg3
  %i16 = add <2 x i64> %i14, %arg4
  %i17 = mul <2 x i64> %i15, %i16
  ret <2 x i64> %i17
}

The changes in the assembly instructions are as follows:

; origin
        umlal2.2d       v2, v3, v4
        umlal.2d        v5, v3, v4
        add.2d  v1, v2, v1
        add.2d  v0, v5, v0
; after reassociate
        add.2d  v0, v2, v0
        add.2d  v1, v2, v1
        umlal.2d        v0, v3, v4
        umlal2.2d       v1, v3, v4

The performance of the altered instruction order has significantly decreased on the Apple M1. (I am not sure if this is also the case for other ARM processors.) My immature guess is that the add instruction is preventing the parallel execution of umlal. Perhaps we need an llvm.aarch64.neon.umlal.* intrinsic?

Here's a real example in Rust: https://rust-lang.zulipchat.com/#narrow/stream/187780-t-compiler.2Fwg-llvm/topic/Is.20instruction.20ordering.20something.20to.20file.20issues.20about.3F/near/453056084 C: https://github.com/Cyan4973/xxHash/blob/a57f6cce2698049863af8c25787084ae0489d849/xxhash.h#L5312-L5323 Godbolt: https://llvm.godbolt.org/z/oeKqn19ff

llvmbot commented 2 months ago

@llvm/issue-subscribers-backend-aarch64

Author: DianQK (DianQK)

The following IR has had its instruction order altered after `reassociate`: ```llvm target datalayout = "e-m:o-i64:64-i48:128-n32:64-S128" target triple = "arm64-apple-macosx11.0.0" define <2 x i64> @src(ptr %arg, ptr %arg1, i64 noundef %arg2, <2 x i64> %arg3, <2 x i64> %arg4, <2 x i64> %arg5, <4 x i32> %arg6, <4 x i32> %arg7) { bb: %i = shufflevector <4 x i32> %arg6, <4 x i32> poison, <2 x i32> <i32 0, i32 1> %i8 = shufflevector <4 x i32> %arg7, <4 x i32> poison, <2 x i32> <i32 0, i32 1> %i9 = tail call <2 x i64> @llvm.aarch64.neon.umull.v2i64(<2 x i32> %i, <2 x i32> %i8) %i10 = add <2 x i64> %i9, %arg5 %i11 = shufflevector <4 x i32> %arg6, <4 x i32> poison, <2 x i32> <i32 2, i32 3> %i12 = shufflevector <4 x i32> %arg7, <4 x i32> poison, <2 x i32> <i32 2, i32 3> %i13 = tail call <2 x i64> @llvm.aarch64.neon.umull.v2i64(<2 x i32> %i11, <2 x i32> %i12) %i14 = add <2 x i64> %i13, %arg5 ; tail call void asm sideeffect alignstack "/* ${0} */", "w,~{cc},~{memory}"(<2 x i64> %i10) ; tail call void asm sideeffect alignstack "/* ${0} */", "w,~{cc},~{memory}"(<2 x i64> %i14) %i15 = add <2 x i64> %i10, %arg3 %i16 = add <2 x i64> %i14, %arg4 %i17 = mul <2 x i64> %i15, %i16 ret <2 x i64> %i17 } ``` The changes in the assembly instructions are as follows: ``` ; origin umlal2.2d v2, v3, v4 umlal.2d v5, v3, v4 add.2d v1, v2, v1 add.2d v0, v5, v0 ; after reassociate add.2d v0, v2, v0 add.2d v1, v2, v1 umlal.2d v0, v3, v4 umlal2.2d v1, v3, v4 ``` The performance of the altered instruction order has significantly decreased on the Apple M1. (I am not sure if this is also the case for other ARM processors.) My immature guess is that the `add` instruction is preventing the parallel execution of `umlal`. Perhaps we need an `llvm.aarch64.neon.umlal.*` intrinsic? Here's a real example in Rust: https://rust-lang.zulipchat.com/#narrow/stream/187780-t-compiler.2Fwg-llvm/topic/Is.20instruction.20ordering.20something.20to.20file.20issues.20about.3F/near/453056084 C: https://github.com/Cyan4973/xxHash/blob/a57f6cce2698049863af8c25787084ae0489d849/xxhash.h#L5312-L5323 Godbolt: https://llvm.godbolt.org/z/oeKqn19ff

DianQK commented 2 months ago

@davemgreen @efriedma-quic Could you take a look at this issue (to see if it's what I suspect it is)? :p

davemgreen commented 2 months ago

It looks like this is probably true of other cores too if the umlal can start executing earlier. We have usually tried to solve issues like this in the MachineCombiner, which can take the latencies and depths of the instructions into account to re-associate the result back. Could the same thing work here?

davemgreen commented 2 months ago

https://github.com/llvm/llvm-project/pull/99634 mentions the barriers were not needed in that version, but I can imagine with slightly different code the un-reassociation would still be useful.

DianQK commented 2 months ago

It looks like this is probably true of other cores too if the umlal can start executing earlier. We have usually tried to solve issues like this in the MachineCombiner, which can take the latencies and depths of the instructions into account to re-associate the result back. Could the same thing work here?

That sounds reasonable. It looks like we need to consider this issue in the loop. Fortunately, the number of instructions is quite small: https://rust.godbolt.org/z/4hTnqcnMz. But I know very little about the details of CPU execution, and so far, I haven't found the reason.

llvm / llvm-project

[AArch64] The `umlal` instruction that cannot be executed in parallel? #100371