Open Maratyszcza opened 2 years ago
@llvm/issue-subscribers-backend-aarch64
I'd like to see whether I can implement this, but my experience with LLVM is zero (apart from compiling and some packaging).
Cc. @kbeyls, @davemgreen, @Tarinn
@kbeyls said on IRC that sometimes EXT might be faster than DUP, but didn't remember when or why.
I came up with this test case in vcombine.ll
, which would cover our case in curve25519-dalek.
define <2 x i32> @vget_high32(ptr %A) nounwind {
; CHECK-LABEL: vget_high32
; CHECK-NOT: vst
; CHECK-NOT: vmov
; CHECK-LE-NOT: vld1.64 {d16, d17}, [r0]
; CHECK: vldr d16, [r0, #8]
; CHECK: dup
%tmp1 = load <4 x i32>, ptr %A
%tmp2 = shufflevector <4 x i32> %tmp1, <4 x i32> undef, <2 x i32> <i32 2, i32 3>
ret <2 x i32> %tmp2
}
Do you mean the DUP which is a MOV, as in the GISel version of https://godbolt.org/z/jxEnWa1jz? If so I think that sounds good.
There are tablegen patterns for "extract_subvector", that probably want to use DUPi64 as opposed to the EXT instructions. It looks like they are defined in a couple of places at the moment. The ones here look incorrect and a probably unused/dead: https://github.com/llvm/llvm-project/blob/a625435d3ef4c7bbfceb44498b9b5a2cbbed838b/llvm/lib/Target/AArch64/AArch64InstrInfo.td#L9463 There are ones in the ExtPat<> multiclass too. Hopefully one of the two can be updated to use the new instruction, and the other removed?
Do you mean the DUP which is a MOV, as in the GISel version of https://godbolt.org/z/jxEnWa1jz? If so I think that sounds good.
Yes, that's what's meant here, as far as I understand.
I added armv7 with neon as a separate tab too, but I don't observe any neon instructions inserted for that target. https://godbolt.org/z/qP6bs4sq6
I'm unfamiliar with llc
, so I'm probably hitting the wrong buttons.
There are tablegen patterns for "extract_subvector", that probably want to use DUPi64 as opposed to the EXT instructions. It looks like they are defined in a couple of places at the moment. The ones here look incorrect and a probably unused/dead:
llvm-project/llvm/lib/Target/AArch64/AArch64InstrInfo.td
Line 9463 in a625435 def : Pat<(v8i8 (extract_subvector (v16i8 FPR128:$Rn), (i64 1))),
There are ones in the ExtPat<> multiclass too. Hopefully one of the two can be updated to use the new instruction, and the other removed?
I'll have to study that file a bit more to understand what exactly needs changing, will continue on Monday! Thanks for the pointers!
Yeah - Arm has a different register layout to AArch64, where the Q0 register is made up of D0 and D1 so the extract should be as cheap as just using the second register. (In that case it also looks like it is splitting the load into two register).
Clang/LLVM lowers
a = vget_high_XXX(b)
NEON intrinsics toEXT Va.16B, Vb.16B, Vb.16B, #8
instead of theDUP Va.1D, Vb.D[1]
suggested by ARM NEON intrinsics guide. This lowering has two drawbacks:EXT Va.16B, Vb.16B, Vb.16B, #8
generates two 64-bit microoperations instead of one.EXT Va.16B, Vb.16B, Vb.16B, #8
leaves the upper part of the destination register initialized (but unused), thereby preventing its power-gating.