Open Validark opened 2 months ago
@llvm/issue-subscribers-backend-aarch64
Author: Niles Salter (Validark)
With suitable endian handling we could do worse than using this generic fallback expansion: https://clang.godbolt.org/z/e6eME3nx1
define <16 x i8> @bc(i16 %a0) {
%v0 = insertelement <16 x i16> poison, i16 %a0, i32 0
%v1 = shufflevector <16 x i16> %v0, <16 x i16> poison, <16 x i32> zeroinitializer
%v2 = and <16 x i16> %v1, <i16 1, i16 2, i16 4, i16 8, i16 16, i16 32, i16 64, i16 128, i16 256, i16 512, i16 1024, i16 2048, i16 4096, i16 8192, i16 16384, i16 32768>
%v3 = icmp ne <16 x i16> %v2, zeroinitializer
%v4 = sext <16 x i1> %v3 to <16 x i8>
ret <16 x i8> %v4
}
With suitable endian handling we could do worse than using this generic fallback expansion: https://clang.godbolt.org/z/e6eME3nx1
define <16 x i8> @bc(i16 %a0) { %v0 = insertelement <16 x i16> poison, i16 %a0, i32 0 %v1 = shufflevector <16 x i16> %v0, <16 x i16> poison, <16 x i32> zeroinitializer %v2 = and <16 x i16> %v1, <i16 1, i16 2, i16 4, i16 8, i16 16, i16 32, i16 64, i16 128, i16 256, i16 512, i16 1024, i16 2048, i16 4096, i16 8192, i16 16384, i16 32768> %v3 = icmp ne <16 x i16> %v2, zeroinitializer %v4 = sext <16 x i1> %v3 to <16 x i8> ret <16 x i8> %v4 }
Do you think that's better than the tbl+cmtst routine? Dup+cmtst*2+zip+mvnot is probably going to be higher latency?
I was suggesting as a generic fallback as most targets suffer from terrible codegen for this pattern, then we could add backend specific codegen if its still required.
I was suggesting as a generic fallback as most targets suffer from terrible codegen for this pattern, then we could add backend specific codegen if its still required.
I think it's probably going to be required to add different codegen for each backend anyway. E.g. for powerpc64, your version gives:
bc: # @bc
.quad .Lfunc_begin0
.quad .TOC.@tocbase
.quad 0
.Lfunc_begin0:
std 25, -56(1) # 8-byte Folded Spill
std 26, -48(1) # 8-byte Folded Spill
std 27, -40(1) # 8-byte Folded Spill
std 28, -32(1) # 8-byte Folded Spill
std 29, -24(1) # 8-byte Folded Spill
std 30, -16(1) # 8-byte Folded Spill
clrlwi 5, 4, 31
rlwinm 6, 4, 31, 31, 31
rlwinm 7, 4, 30, 31, 31
rlwinm 8, 4, 29, 31, 31
rlwinm 9, 4, 28, 31, 31
rlwinm 10, 4, 27, 31, 31
rlwinm 11, 4, 26, 31, 31
rlwinm 12, 4, 25, 31, 31
rlwinm 0, 4, 24, 31, 31
rlwinm 30, 4, 23, 31, 31
rlwinm 29, 4, 22, 31, 31
rlwinm 28, 4, 21, 31, 31
rlwinm 27, 4, 20, 31, 31
rlwinm 26, 4, 19, 31, 31
rlwinm 25, 4, 18, 31, 31
rlwinm 4, 4, 17, 31, 31
neg 4, 4
stb 4, 15(3)
neg 4, 25
stb 4, 14(3)
neg 4, 26
stb 4, 13(3)
neg 4, 27
stb 4, 12(3)
neg 4, 28
stb 4, 11(3)
neg 4, 29
stb 4, 10(3)
neg 4, 30
stb 4, 9(3)
neg 4, 0
stb 4, 8(3)
neg 4, 12
stb 4, 7(3)
neg 4, 11
stb 4, 6(3)
neg 4, 10
stb 4, 5(3)
neg 4, 9
stb 4, 4(3)
neg 4, 8
stb 4, 3(3)
neg 4, 7
stb 4, 2(3)
neg 4, 6
stb 4, 1(3)
neg 4, 5
stb 4, 0(3)
ld 30, -16(1) # 8-byte Folded Reload
ld 29, -24(1) # 8-byte Folded Reload
ld 28, -32(1) # 8-byte Folded Reload
ld 27, -40(1) # 8-byte Folded Reload
ld 26, -48(1) # 8-byte Folded Reload
ld 25, -56(1) # 8-byte Folded Reload
blr
.long 0
.quad 0
Probably should be:
bar(float __vector(4)):
.LCF0:
0: addis 2,12,.TOC.-.LCF0@ha
addi 2,2,.TOC.-.LCF0@l
addis 9,2,.LC0@toc@ha
addi 9,9,.LC0@toc@l
lvx 0,0,9
vbpermq 2,2,0
mfvsrd 3,34
extsw 3,3
blr
.long 0
.byte 0,9,0,0,0,0,0,0
.LC0:
.byte 120
.byte 112
.byte 104
.byte 96
.byte 88
.byte 80
.byte 72
.byte 64
.byte 56
.byte 48
.byte 40
.byte 32
.byte 24
.byte 16
.byte 8
.byte 0
See https://github.com/llvm/llvm-project/issues/90554
A lot of the backends are worse off than the ARM backends, except the RISC-V backend. It gives:
.LCPI0_0:
.half 1 # 0x1
.half 2 # 0x2
.half 4 # 0x4
.half 8 # 0x8
.half 16 # 0x10
.half 32 # 0x20
.half 64 # 0x40
.half 128 # 0x80
.half 256 # 0x100
.half 512 # 0x200
.half 1024 # 0x400
.half 2048 # 0x800
.half 4096 # 0x1000
.half 8192 # 0x2000
.half 16384 # 0x4000
.half 32768 # 0x8000
bc: # @bc
lui a1, %hi(.LCPI0_0)
vsetivli zero, 16, e16, mf2, ta, ma
addi a1, a1, %lo(.LCPI0_0)
vle16.v v8, (a1)
vand.vx v8, v8, a0
vmsne.vi v0, v8, 0
vsetvli zero, zero, e8, mf4, ta, ma
vmv.v.i v8, 0
vmerge.vim v8, v8, -1, v0
ret
I haven't worked enough with vector-enabled RISC-V assembly to know whether one could do better though. Seems decent though, for sure.
I can get LLVM to do a
bitcast i16 to <16 x i1>
+sext <16 x i1> to <16 x i8>
like so:Unfortunately, LLVM doesn't currently do anything smart for this. Here is the 16-bit version (compiled for Apple M2):
Here is the emit for the 16-bit and 64-bit version compiled for x86-64 Westmere:
Here we have two strategies for accomplishing this task for 16 byte vectors. In the first, we use a
pshufb
, which in ARM-land could be replaced bytbl
, to broadcast the first byte to the first 8 bytes, and the second byte to the second 8 bytes. Then we load up a mask, andpand
+pcmpeqb
to turn each bit into a byte. On ARM we have equivalents both of those instructions, but we also havecmtst
which can do both of those steps in one.The second strategy does
punpcklbw
on the 8 byte bitstring, which interleaves it with itself, which is equivalent tozip1
on ARM. Then it usespshuflw xmm, xmm, 80
+pshufd xmm0, xmm0, 80
. I don't think we have support fortbl
-with-constant on ARM, so I think we have to just use the first strategy.In size-optimized open-code, to produce the second, third, and fourth vector, we might want to reuse the same vector (
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
) to pass into thetbl
instruction, so we could take the scalar input and shl by 16, 32, and 48, and move them into vector registers separately. Alternatively, one could useext
once they are already in vector registers.However, it might make more sense for performance to have a vector that holds each of the following, especially in a loop:
That way, we can
tbl
with each of them in parallel, and run 4cmtst
s in parallel afterwards.