[Aarch64] `bitcast i16 to <16 x i1>` + `sext <16 x i1> to <16 x i8>` should not go bit-by-bit

Validark commented 2 months ago

I can get LLVM to do a bitcast i16 to <16 x i1> + sext <16 x i1> to <16 x i8> like so:

export fn unmovemask16(x: u16) @Vector(16, u8) {
    return @select(u8, @as(@Vector(16, u1), @bitCast(x)) == @as(@Vector(16, u1), @splat(1)),
        @as(@Vector(16, u8), @splat(0xff)),
        @as(@Vector(16, u8), @splat(0)));
}

export fn unmovemask64(x: u64) @Vector(64, u8) {
    return @select(u8, @as(@Vector(64, u1), @bitCast(x)) == @as(@Vector(64, u1), @splat(1)),
        @as(@Vector(64, u8), @splat(0xff)),
        @as(@Vector(64, u8), @splat(0)));
}

Unfortunately, LLVM doesn't currently do anything smart for this. Here is the 16-bit version (compiled for Apple M2):

unmovemask_normal: // @unmovemask_normal
  sub sp, sp, #16
  sbfx w8, w0, #1, #1
  sbfx w9, w0, #0, #1
  fmov s0, w9
  mov v0.b[1], w8
  sbfx w8, w0, #2, #1
  mov v0.b[2], w8
  sbfx w8, w0, #3, #1
  mov v0.b[3], w8
  sbfx w8, w0, #4, #1
  mov v0.b[4], w8
  sbfx w8, w0, #5, #1
  mov v0.b[5], w8
  sbfx w8, w0, #6, #1
  mov v0.b[6], w8
  sbfx w8, w0, #7, #1
  mov v0.b[7], w8
  sbfx w8, w0, #8, #1
  mov v0.b[8], w8
  sbfx w8, w0, #9, #1
  mov v0.b[9], w8
  sbfx w8, w0, #10, #1
  mov v0.b[10], w8
  sbfx w8, w0, #11, #1
  mov v0.b[11], w8
  sbfx w8, w0, #12, #1
  mov v0.b[12], w8
  sbfx w8, w0, #13, #1
  mov v0.b[13], w8
  sbfx w8, w0, #14, #1
  mov v0.b[14], w8
  sbfx w8, w0, #15, #1
  mov v0.b[15], w8
  add sp, sp, #16
  ret

Here is the emit for the 16-bit and 64-bit version compiled for x86-64 Westmere:

.LCPI0_0:
        .byte   0
        .byte   0
        .byte   0
        .byte   0
        .byte   0
        .byte   0
        .byte   0
        .byte   0
        .byte   1
        .byte   1
        .byte   1
        .byte   1
        .byte   1
        .byte   1
        .byte   1
        .byte   1
.LCPI0_1:
        .byte   1
        .byte   2
        .byte   4
        .byte   8
        .byte   16
        .byte   32
        .byte   64
        .byte   128
        .byte   1
        .byte   2
        .byte   4
        .byte   8
        .byte   16
        .byte   32
        .byte   64
        .byte   128
unmovemask16:
        movd    xmm0, edi
        pshufb  xmm0, xmmword ptr [rip + .LCPI0_0]
        movdqa  xmm1, xmmword ptr [rip + .LCPI0_1]
        pand    xmm0, xmm1
        pcmpeqb xmm0, xmm1
        ret

.LCPI1_0:
        .byte   1
        .byte   2
        .byte   4
        .byte   8
        .byte   16
        .byte   32
        .byte   64
        .byte   128
        .byte   1
        .byte   2
        .byte   4
        .byte   8
        .byte   16
        .byte   32
        .byte   64
        .byte   128
unmovemask64:
        movq    xmm3, rdi
        punpcklbw       xmm3, xmm3
        pshuflw xmm0, xmm3, 80
        pshufd  xmm0, xmm0, 80
        movdqa  xmm4, xmmword ptr [rip + .LCPI1_0]
        pand    xmm0, xmm4
        pcmpeqb xmm0, xmm4
        pshuflw xmm1, xmm3, 250
        pshufd  xmm1, xmm1, 80
        pand    xmm1, xmm4
        pcmpeqb xmm1, xmm4
        pshufhw xmm2, xmm3, 80
        pshufd  xmm2, xmm2, 250
        pand    xmm2, xmm4
        pcmpeqb xmm2, xmm4
        pshufhw xmm3, xmm3, 250
        pshufd  xmm3, xmm3, 250
        pand    xmm3, xmm4
        pcmpeqb xmm3, xmm4
        ret

Here we have two strategies for accomplishing this task for 16 byte vectors. In the first, we use a pshufb, which in ARM-land could be replaced by tbl, to broadcast the first byte to the first 8 bytes, and the second byte to the second 8 bytes. Then we load up a mask, and pand+pcmpeqb to turn each bit into a byte. On ARM we have equivalents both of those instructions, but we also have cmtst which can do both of those steps in one.

The second strategy does punpcklbw on the 8 byte bitstring, which interleaves it with itself, which is equivalent to zip1 on ARM. Then it uses pshuflw xmm, xmm, 80+pshufd xmm0, xmm0, 80. I don't think we have support for tbl-with-constant on ARM, so I think we have to just use the first strategy.

In size-optimized open-code, to produce the second, third, and fourth vector, we might want to reuse the same vector (0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1) to pass into the tbl instruction, so we could take the scalar input and shl by 16, 32, and 48, and move them into vector registers separately. Alternatively, one could use ext once they are already in vector registers.

However, it might make more sense for performance to have a vector that holds each of the following, especially in a loop:

0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5
6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7

That way, we can tbl with each of them in parallel, and run 4 cmtsts in parallel afterwards.

.LCPI0_0:
        .byte   0
        .byte   0
        .byte   0
        .byte   0
        .byte   0
        .byte   0
        .byte   0
        .byte   0
        .byte   1
        .byte   1
        .byte   1
        .byte   1
        .byte   1
        .byte   1
        .byte   1
        .byte   1
.LCPI0_1:
        .byte   2
        .byte   2
        .byte   2
        .byte   2
        .byte   2
        .byte   2
        .byte   2
        .byte   2
        .byte   3
        .byte   3
        .byte   3
        .byte   3
        .byte   3
        .byte   3
        .byte   3
        .byte   3
.LCPI0_2:
        .byte   4
        .byte   4
        .byte   4
        .byte   4
        .byte   4
        .byte   4
        .byte   4
        .byte   4
        .byte   5
        .byte   5
        .byte   5
        .byte   5
        .byte   5
        .byte   5
        .byte   5
        .byte   5
.LCPI0_3:
        .byte   6
        .byte   6
        .byte   6
        .byte   6
        .byte   6
        .byte   6
        .byte   6
        .byte   6
        .byte   7
        .byte   7
        .byte   7
        .byte   7
        .byte   7
        .byte   7
        .byte   7
        .byte   7
.LCPI0_4:
        .byte   1
        .byte   2
        .byte   4
        .byte   8
        .byte   16
        .byte   32
        .byte   64
        .byte   128
        .byte   1
        .byte   2
        .byte   4
        .byte   8
        .byte   16
        .byte   32
        .byte   64
        .byte   128
unmovemask64:
        fmov    d0, x0
        adrp    x9, .LCPI0_0
        ldr     q1, [x9, :lo12:.LCPI0_0]
        tbl     v1.16b, { v0.16b }, v1.16b
        adrp    x9, .LCPI0_1
        ldr     q2, [x9, :lo12:.LCPI0_1]
        tbl     v2.16b, { v0.16b }, v2.16b
        adrp    x9, .LCPI0_2
        ldr     q3, [x9, :lo12:.LCPI0_2]
        tbl     v3.16b, { v0.16b }, v3.16b
        adrp    x9, .LCPI0_3
        ldr     q4, [x9, :lo12:.LCPI0_3]
        tbl     v0.16b, { v0.16b }, v4.16b
        adrp    x9, .LCPI0_4
        ldr     q4, [x9, :lo12:.LCPI0_4]
        cmtst   v1.16b, v1.16b, v4.16b
        cmtst   v3.16b, v3.16b, v4.16b
        cmtst   v0.16b, v0.16b, v4.16b
        stp     q3, q0, [x8, #32]
        cmtst   v0.16b, v2.16b, v4.16b
        stp     q1, q0, [x8]
        ret

llvmbot commented 2 months ago

@llvm/issue-subscribers-backend-aarch64

Author: Niles Salter (Validark)

I can get LLVM to do a `bitcast i16 to <16 x i1>` + `sext <16 x i1> to <16 x i8>` like so: ```zig export fn unmovemask16(x: u16) @Vector(16, u8) { return @select(u8, @as(@Vector(16, u1), @bitCast(x)) == @as(@Vector(16, u1), @splat(1)), @as(@Vector(16, u8), @splat(0xff)), @as(@Vector(16, u8), @splat(0))); } export fn unmovemask64(x: u64) @Vector(64, u8) { return @select(u8, @as(@Vector(64, u1), @bitCast(x)) == @as(@Vector(64, u1), @splat(1)), @as(@Vector(64, u8), @splat(0xff)), @as(@Vector(64, u8), @splat(0))); } ``` Unfortunately, LLVM doesn't currently do anything smart for this. Here is the 16-bit version (compiled for Apple M2): ```asm unmovemask_normal: // @unmovemask_normal sub sp, sp, #16 sbfx w8, w0, #1, #1 sbfx w9, w0, #0, #1 fmov s0, w9 mov v0.b[1], w8 sbfx w8, w0, #2, #1 mov v0.b[2], w8 sbfx w8, w0, #3, #1 mov v0.b[3], w8 sbfx w8, w0, #4, #1 mov v0.b[4], w8 sbfx w8, w0, #5, #1 mov v0.b[5], w8 sbfx w8, w0, #6, #1 mov v0.b[6], w8 sbfx w8, w0, #7, #1 mov v0.b[7], w8 sbfx w8, w0, #8, #1 mov v0.b[8], w8 sbfx w8, w0, #9, #1 mov v0.b[9], w8 sbfx w8, w0, #10, #1 mov v0.b[10], w8 sbfx w8, w0, #11, #1 mov v0.b[11], w8 sbfx w8, w0, #12, #1 mov v0.b[12], w8 sbfx w8, w0, #13, #1 mov v0.b[13], w8 sbfx w8, w0, #14, #1 mov v0.b[14], w8 sbfx w8, w0, #15, #1 mov v0.b[15], w8 add sp, sp, #16 ret ``` Here is the emit for the 16-bit and 64-bit version compiled for x86-64 Westmere: ```asm .LCPI0_0: .byte 0 .byte 0 .byte 0 .byte 0 .byte 0 .byte 0 .byte 0 .byte 0 .byte 1 .byte 1 .byte 1 .byte 1 .byte 1 .byte 1 .byte 1 .byte 1 .LCPI0_1: .byte 1 .byte 2 .byte 4 .byte 8 .byte 16 .byte 32 .byte 64 .byte 128 .byte 1 .byte 2 .byte 4 .byte 8 .byte 16 .byte 32 .byte 64 .byte 128 unmovemask16: movd xmm0, edi pshufb xmm0, xmmword ptr [rip + .LCPI0_0] movdqa xmm1, xmmword ptr [rip + .LCPI0_1] pand xmm0, xmm1 pcmpeqb xmm0, xmm1 ret .LCPI1_0: .byte 1 .byte 2 .byte 4 .byte 8 .byte 16 .byte 32 .byte 64 .byte 128 .byte 1 .byte 2 .byte 4 .byte 8 .byte 16 .byte 32 .byte 64 .byte 128 unmovemask64: movq xmm3, rdi punpcklbw xmm3, xmm3 pshuflw xmm0, xmm3, 80 pshufd xmm0, xmm0, 80 movdqa xmm4, xmmword ptr [rip + .LCPI1_0] pand xmm0, xmm4 pcmpeqb xmm0, xmm4 pshuflw xmm1, xmm3, 250 pshufd xmm1, xmm1, 80 pand xmm1, xmm4 pcmpeqb xmm1, xmm4 pshufhw xmm2, xmm3, 80 pshufd xmm2, xmm2, 250 pand xmm2, xmm4 pcmpeqb xmm2, xmm4 pshufhw xmm3, xmm3, 250 pshufd xmm3, xmm3, 250 pand xmm3, xmm4 pcmpeqb xmm3, xmm4 ret ``` Here we have two strategies for accomplishing this task for 16 byte vectors. In the first, we use a `pshufb`, which in ARM-land could be replaced by `tbl`, to broadcast the first byte to the first 8 bytes, and the second byte to the second 8 bytes. Then we load up a mask, and `pand`+`pcmpeqb` to turn each bit into a byte. On ARM we have equivalents both of those instructions, but we also have `cmtst` which can do both of those steps in one. The second strategy does `punpcklbw` on the 8 byte bitstring, which interleaves it with itself, which is equivalent to `zip1` on ARM. Then it uses `pshuflw xmm, xmm, 80`+`pshufd xmm0, xmm0, 80`. I don't think we have support for `tbl`-with-constant on ARM, so I think we have to just use the first strategy. In open-code, to produce the second, third, and fourth vector, we might want to reuse the same vector (`0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1`) to pass into the `tbl` instruction, so we could take the scalar input and shl by 16, 32, and 48, and move them into vector registers separately. Alternatively, one could use `ext` once they are already in vector registers. In a loop, however, it might make more sense to have a vector that holds each of the following: ``` 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 ``` That way, we can `tbl` with each of them in parallel, and run 4 `cmtst`s in parallel afterwards.

RKSimon commented 2 months ago

With suitable endian handling we could do worse than using this generic fallback expansion: https://clang.godbolt.org/z/e6eME3nx1

define <16 x i8> @bc(i16 %a0) {
  %v0 = insertelement <16 x i16> poison, i16 %a0, i32 0
  %v1 = shufflevector <16 x i16> %v0, <16 x i16> poison, <16 x i32> zeroinitializer
  %v2 = and <16 x i16> %v1, <i16 1, i16 2, i16 4, i16 8, i16 16, i16 32, i16 64, i16 128, i16 256, i16 512, i16 1024, i16 2048, i16 4096, i16 8192, i16 16384, i16 32768> 
  %v3 = icmp ne <16 x i16> %v2, zeroinitializer
  %v4 = sext <16 x i1> %v3 to <16 x i8>
  ret <16 x i8> %v4
}

Validark commented 2 months ago

With suitable endian handling we could do worse than using this generic fallback expansion: https://clang.godbolt.org/z/e6eME3nx1

define <16 x i8> @bc(i16 %a0) {
  %v0 = insertelement <16 x i16> poison, i16 %a0, i32 0
  %v1 = shufflevector <16 x i16> %v0, <16 x i16> poison, <16 x i32> zeroinitializer
  %v2 = and <16 x i16> %v1, <i16 1, i16 2, i16 4, i16 8, i16 16, i16 32, i16 64, i16 128, i16 256, i16 512, i16 1024, i16 2048, i16 4096, i16 8192, i16 16384, i16 32768> 
  %v3 = icmp ne <16 x i16> %v2, zeroinitializer
  %v4 = sext <16 x i1> %v3 to <16 x i8>
  ret <16 x i8> %v4
}

Do you think that's better than the tbl+cmtst routine? Dup+cmtst*2+zip+mvnot is probably going to be higher latency?

RKSimon commented 2 months ago

I was suggesting as a generic fallback as most targets suffer from terrible codegen for this pattern, then we could add backend specific codegen if its still required.

Validark commented 2 months ago

I was suggesting as a generic fallback as most targets suffer from terrible codegen for this pattern, then we could add backend specific codegen if its still required.

I think it's probably going to be required to add different codegen for each backend anyway. E.g. for powerpc64, your version gives:

bc: # @bc
  .quad .Lfunc_begin0
  .quad .TOC.@tocbase
  .quad 0
.Lfunc_begin0:
  std 25, -56(1) # 8-byte Folded Spill
  std 26, -48(1) # 8-byte Folded Spill
  std 27, -40(1) # 8-byte Folded Spill
  std 28, -32(1) # 8-byte Folded Spill
  std 29, -24(1) # 8-byte Folded Spill
  std 30, -16(1) # 8-byte Folded Spill
  clrlwi 5, 4, 31
  rlwinm 6, 4, 31, 31, 31
  rlwinm 7, 4, 30, 31, 31
  rlwinm 8, 4, 29, 31, 31
  rlwinm 9, 4, 28, 31, 31
  rlwinm 10, 4, 27, 31, 31
  rlwinm 11, 4, 26, 31, 31
  rlwinm 12, 4, 25, 31, 31
  rlwinm 0, 4, 24, 31, 31
  rlwinm 30, 4, 23, 31, 31
  rlwinm 29, 4, 22, 31, 31
  rlwinm 28, 4, 21, 31, 31
  rlwinm 27, 4, 20, 31, 31
  rlwinm 26, 4, 19, 31, 31
  rlwinm 25, 4, 18, 31, 31
  rlwinm 4, 4, 17, 31, 31
  neg 4, 4
  stb 4, 15(3)
  neg 4, 25
  stb 4, 14(3)
  neg 4, 26
  stb 4, 13(3)
  neg 4, 27
  stb 4, 12(3)
  neg 4, 28
  stb 4, 11(3)
  neg 4, 29
  stb 4, 10(3)
  neg 4, 30
  stb 4, 9(3)
  neg 4, 0
  stb 4, 8(3)
  neg 4, 12
  stb 4, 7(3)
  neg 4, 11
  stb 4, 6(3)
  neg 4, 10
  stb 4, 5(3)
  neg 4, 9
  stb 4, 4(3)
  neg 4, 8
  stb 4, 3(3)
  neg 4, 7
  stb 4, 2(3)
  neg 4, 6
  stb 4, 1(3)
  neg 4, 5
  stb 4, 0(3)
  ld 30, -16(1) # 8-byte Folded Reload
  ld 29, -24(1) # 8-byte Folded Reload
  ld 28, -32(1) # 8-byte Folded Reload
  ld 27, -40(1) # 8-byte Folded Reload
  ld 26, -48(1) # 8-byte Folded Reload
  ld 25, -56(1) # 8-byte Folded Reload
  blr
  .long 0
  .quad 0

Probably should be:

bar(float __vector(4)):
.LCF0:
0:      addis 2,12,.TOC.-.LCF0@ha
        addi 2,2,.TOC.-.LCF0@l
        addis 9,2,.LC0@toc@ha
        addi 9,9,.LC0@toc@l
        lvx 0,0,9
        vbpermq 2,2,0
        mfvsrd 3,34
        extsw 3,3
        blr
        .long 0
        .byte 0,9,0,0,0,0,0,0
.LC0:
        .byte   120
        .byte   112
        .byte   104
        .byte   96
        .byte   88
        .byte   80
        .byte   72
        .byte   64
        .byte   56
        .byte   48
        .byte   40
        .byte   32
        .byte   24
        .byte   16
        .byte   8
        .byte   0

See https://github.com/llvm/llvm-project/issues/90554

A lot of the backends are worse off than the ARM backends, except the RISC-V backend. It gives:

.LCPI0_0:
  .half 1 # 0x1
  .half 2 # 0x2
  .half 4 # 0x4
  .half 8 # 0x8
  .half 16 # 0x10
  .half 32 # 0x20
  .half 64 # 0x40
  .half 128 # 0x80
  .half 256 # 0x100
  .half 512 # 0x200
  .half 1024 # 0x400
  .half 2048 # 0x800
  .half 4096 # 0x1000
  .half 8192 # 0x2000
  .half 16384 # 0x4000
  .half 32768 # 0x8000
bc: # @bc
  lui a1, %hi(.LCPI0_0)
  vsetivli zero, 16, e16, mf2, ta, ma
  addi a1, a1, %lo(.LCPI0_0)
  vle16.v v8, (a1)
  vand.vx v8, v8, a0
  vmsne.vi v0, v8, 0
  vsetvli zero, zero, e8, mf4, ta, ma
  vmv.v.i v8, 0
  vmerge.vim v8, v8, -1, v0
  ret

I haven't worked enough with vector-enabled RISC-V assembly to know whether one could do better though. Seems decent though, for sure.

llvm / llvm-project

[Aarch64] `bitcast i16 to <16 x i1>` + `sext <16 x i1> to <16 x i8>` should not go bit-by-bit #107700