Open preames opened 1 year ago
@llvm/issue-subscribers-backend-risc-v
https://reviews.llvm.org/D149263 for the vslide1down part of this.
First patch has landed, a second to improve undef sub-sequences is now posted: https://reviews.llvm.org/D149658
Current codegen for the buildvec_4xi32 case is:
buildvec_4xi32: # @buildvec_4xi32
# %bb.0:
vsetivli zero, 4, e32, m1, ta, ma
vslide1down.vx v8, v8, a0
vslide1down.vx v8, v8, a1
vslide1down.vx v8, v8, a2
vslide1down.vx v8, v8, a3
ret
Codegen for the 2xi32 case hasn't yet changed.
Looking at the examples below, we've got a couple of possibilities for ways to improve generic buildvector codegen. Please take the follow as a list of ideas; not all of these may work out. Note that I'm also talking about the generic case with no repeated elements, etc..
For vectors with power of two lengths less or equal than 64-bit, we can do shift/or on the scalar side + a single scalar-vector move. This may require a VTYPE toggle, but that's likely cheaper than a series of inserts.
For vectors with power of two lengths greater than 64-bits, we can group into 64-bit chunks. This reduces the number of vector instructions and I to V moves, at the cost of extra scalar work.
We should be able to use either
vslide1up
orvslide1down
. If we can exploit the undefined tail property, we should be able to do this without individual VL toggles between inserts. Note that this requires undefined tail, not simply tail agnostic. Combined with the above, we should have onevsetvli
+ VLEN/64 inserts.Note that the case where VLEN=128 is particularly important - as it is the minimum guaranteed by V, and thus what SLP is able to target by default.