Improvements to buildvector codegen

preames commented 1 year ago

Looking at the examples below, we've got a couple of possibilities for ways to improve generic buildvector codegen. Please take the follow as a list of ideas; not all of these may work out. Note that I'm also talking about the generic case with no repeated elements, etc..

For vectors with power of two lengths less or equal than 64-bit, we can do shift/or on the scalar side + a single scalar-vector move. This may require a VTYPE toggle, but that's likely cheaper than a series of inserts.

For vectors with power of two lengths greater than 64-bits, we can group into 64-bit chunks. This reduces the number of vector instructions and I to V moves, at the cost of extra scalar work.

We should be able to use either vslide1up or vslide1down. If we can exploit the undefined tail property, we should be able to do this without individual VL toggles between inserts. Note that this requires undefined tail, not simply tail agnostic. Combined with the above, we should have one vsetvli + VLEN/64 inserts.

Note that the case where VLEN=128 is particularly important - as it is the minimum guaranteed by V, and thus what SLP is able to target by default.

$ cat buildvector.ll
define <2 x i32> @buildvec_2xi32(i32 %a, i32 %b) {
  %v1 = insertelement <2 x i32> poison, i32 %a, i32 0
  %v2 = insertelement <2 x i32> %v1, i32 %b, i32 1
  ret <2 x i32> %v2
}

define <4 x i32> @buildvec_4xi32(i32 %a, i32 %b, i32 %c, i32 %d) {
  %v1 = insertelement <4 x i32> poison, i32 %a, i32 0
  %v2 = insertelement <4 x i32> %v1, i32 %b, i32 1
  %v3 = insertelement <4 x i32> %v2, i32 %c, i32 2
  %v4 = insertelement <4 x i32> %v3, i32 %d, i32 3
  ret <4 x i32> %v4
}

$ ./opt -S buildvector.ll -O3 | ./llc -mtriple=riscv64 -mattr=+v
    .text
    .attribute  4, 16
    .attribute  5, "rv64i2p1_f2p2_d2p2_v1p0_zicsr2p0_zve32f1p0_zve32x1p0_zve64d1p0_zve64f1p0_zve64x1p0_zvl128b1p0_zvl32b1p0_zvl64b1p0"
    .file   "buildvector.ll"
    .globl  buildvec_2xi32                  # -- Begin function buildvec_2xi32
    .p2align    2
    .type   buildvec_2xi32,@function
    .variant_cc buildvec_2xi32
buildvec_2xi32:                         # @buildvec_2xi32
# %bb.0:
    vsetivli    zero, 2, e32, mf2, ta, ma
    vmv.v.x v8, a1
    vsetvli zero, zero, e32, mf2, tu, ma
    vmv.s.x v8, a0
    ret
.Lfunc_end0:
    .size   buildvec_2xi32, .Lfunc_end0-buildvec_2xi32
                                        # -- End function
    .globl  buildvec_4xi32                  # -- Begin function buildvec_4xi32
    .p2align    2
    .type   buildvec_4xi32,@function
    .variant_cc buildvec_4xi32
buildvec_4xi32:                         # @buildvec_4xi32
# %bb.0:
    addi    sp, sp, -16
    sw  a3, 12(sp)
    sw  a2, 8(sp)
    sw  a1, 4(sp)
    sw  a0, 0(sp)
    mv  a0, sp
    vsetivli    zero, 4, e32, m1, ta, ma
    vle32.v v8, (a0)
    addi    sp, sp, 16
    ret
.Lfunc_end1:
    .size   buildvec_4xi32, .Lfunc_end1-buildvec_4xi32
                                        # -- End function
    .section    ".note.GNU-stack","",@progbits

llvmbot commented 1 year ago

@llvm/issue-subscribers-backend-risc-v

preames commented 1 year ago

https://reviews.llvm.org/D149263 for the vslide1down part of this.

preames commented 1 year ago

First patch has landed, a second to improve undef sub-sequences is now posted: https://reviews.llvm.org/D149658

Current codegen for the buildvec_4xi32 case is:

buildvec_4xi32:                         # @buildvec_4xi32
# %bb.0:
    vsetivli    zero, 4, e32, m1, ta, ma
    vslide1down.vx  v8, v8, a0
    vslide1down.vx  v8, v8, a1
    vslide1down.vx  v8, v8, a2
    vslide1down.vx  v8, v8, a3
    ret

Codegen for the 2xi32 case hasn't yet changed.

llvm / llvm-project

Improvements to buildvector codegen #62365