riscvarchive / riscv-v-spec

Working draft of the proposed RISC-V V vector extension
https://jira.riscv.org/browse/RVG-122
Creative Commons Attribution 4.0 International
973 stars 272 forks source link

Rotate and shift vector mask bits #919

Closed camel-cdr closed 1 year ago

camel-cdr commented 1 year ago

I ran into the problem of needing to shift mask bits up or down by one for quite a few vectorized algorithms I've been working on.

My current approach is to go from every bit to a full byte for, use vslide*up/down and go back to a mask register, but this is very waist full.

I was wondering if it could be possible to add bit shift and bit rotate instructions for the mask register, maybe v0 only? That is, it works across element boundaries, but only for mask registers.

This would help with the problem I described above, but also allow reducing register pressure, as you could e.g. store 8 mask for LMUL=1 registers in a single mask register, and just use a potential vmrotl by 8*idx to rotate the bits to the mask you want to use.

This can already be done by using vslideup + vslidedown + vmerge, but that's a lot more expensive (a possible future rotate elements instruction could also help with this).

I just want to put this idea out there, and see if other people could benefit from such an addition to future versions of the spec. I also might have missed a better way than described above to already do this in the current spec.

aswaterman commented 1 year ago

I've heard other potential use cases for mask shifts. (Mask rotates are more expensive if the rotation is taken with respect to vl rather than VLMAX, since the shifted-off bits need to land in an arbitrary bit position.)

If you can constrain VL <= 64 (which in portable code might require an extra min instruction to constrain AVL at the head of the strip-mine loop), then you can type-pun the mask as a single 64b element and right-shift it:

vsetivli x0, 1, e64, m1, tu, ma
vsrl.vi v0, v0, 1

The extra setvls aren't ideal, but on most implementations they're cheap. Constraining VL to 64 also isn't ideal, but for apps processors with relatively small VLEN, it won't hurt.

nick-knight commented 1 year ago

I think we agree that anything is possible using left/right shifts, up/down slides, and bitwise logic.

Additionally, special cases, like shifting by one bit, could leverage add-with-carry or subtract-with-borrow (to avoid slides).

I think it would be helpful to provide concrete use-cases, to evaluate these existing possibilities.