Suboptimal lowering of `vget_high_XXX` on AArch64

Maratyszcza commented 2 years ago

Clang/LLVM lowers a = vget_high_XXX(b) NEON intrinsics to EXT Va.16B, Vb.16B, Vb.16B, #8 instead of the DUP Va.1D, Vb.D[1] suggested by ARM NEON intrinsics guide. This lowering has two drawbacks:

On AArch64 cores which split NEON instructions into 64-bit microoperations, EXT Va.16B, Vb.16B, Vb.16B, #8 generates two 64-bit microoperations instead of one.
On all AArch64 cores EXT Va.16B, Vb.16B, Vb.16B, #8 leaves the upper part of the destination register initialized (but unused), thereby preventing its power-gating.

llvmbot commented 2 years ago

@llvm/issue-subscribers-backend-aarch64

rubdos commented 2 months ago

I'd like to see whether I can implement this, but my experience with LLVM is zero (apart from compiling and some packaging).

Cc. @kbeyls, @davemgreen, @Tarinn

@kbeyls said on IRC that sometimes EXT might be faster than DUP, but didn't remember when or why.

I came up with this test case in vcombine.ll, which would cover our case in curve25519-dalek.

define <2 x i32> @vget_high32(ptr %A) nounwind {
; CHECK-LABEL: vget_high32
; CHECK-NOT: vst
; CHECK-NOT: vmov
; CHECK-LE-NOT: vld1.64 {d16, d17}, [r0]
; CHECK: vldr  d16, [r0, #8]
; CHECK: dup
    %tmp1 = load <4 x i32>, ptr %A
        %tmp2 = shufflevector <4 x i32> %tmp1, <4 x i32> undef, <2 x i32> <i32 2, i32 3>
        ret <2 x i32> %tmp2
}

davemgreen commented 2 months ago

Do you mean the DUP which is a MOV, as in the GISel version of https://godbolt.org/z/jxEnWa1jz? If so I think that sounds good.

There are tablegen patterns for "extract_subvector", that probably want to use DUPi64 as opposed to the EXT instructions. It looks like they are defined in a couple of places at the moment. The ones here look incorrect and a probably unused/dead: https://github.com/llvm/llvm-project/blob/a625435d3ef4c7bbfceb44498b9b5a2cbbed838b/llvm/lib/Target/AArch64/AArch64InstrInfo.td#L9463 There are ones in the ExtPat<> multiclass too. Hopefully one of the two can be updated to use the new instruction, and the other removed?

rubdos commented 2 months ago

Do you mean the DUP which is a MOV, as in the GISel version of https://godbolt.org/z/jxEnWa1jz? If so I think that sounds good.

Yes, that's what's meant here, as far as I understand.

I added armv7 with neon as a separate tab too, but I don't observe any neon instructions inserted for that target. https://godbolt.org/z/qP6bs4sq6 I'm unfamiliar with llc, so I'm probably hitting the wrong buttons.

There are tablegen patterns for "extract_subvector", that probably want to use DUPi64 as opposed to the EXT instructions. It looks like they are defined in a couple of places at the moment. The ones here look incorrect and a probably unused/dead:

llvm-project/llvm/lib/Target/AArch64/AArch64InstrInfo.td

Line 9463 in a625435 def : Pat<(v8i8 (extract_subvector (v16i8 FPR128:$Rn), (i64 1))),

There are ones in the ExtPat<> multiclass too. Hopefully one of the two can be updated to use the new instruction, and the other removed?

I'll have to study that file a bit more to understand what exactly needs changing, will continue on Monday! Thanks for the pointers!

davemgreen commented 2 months ago

Yeah - Arm has a different register layout to AArch64, where the Q0 register is made up of D0 and D1 so the extract should be as cheap as just using the second register. (In that case it also looks like it is splitting the load into two register).

llvm / llvm-project

Suboptimal lowering of `vget_high_XXX` on AArch64 #58323