riscv / riscv-v-spec

Working draft of the proposed RISC-V V vector extension
https://jira.riscv.org/browse/RVG-122
Creative Commons Attribution 4.0 International
959 stars 272 forks source link

POST V1.0 - support .vvm variants for slide1up/slide1down #358

Open David-Horner opened 4 years ago

David-Horner commented 4 years ago

This variant is a simple, low cost and extension consistent to the merge instructions:

Description: This .vvm variant uses the vs1 vector register group as the source for the non-masked elements, replacing rd for this purpose. All tail elements are copied over from vs1. As with .vx variants destination cannot overlap either source.

The ,vv non-masked variant should be reserved as it is equivalent to the .vx non-masked.

It is a natural extension to the instruction and it parallels the .vvm variant of merge.

It's cost then is comparable to other variants to base instructions.

To be consistent with the slideup/slidedown instructions, when that mask bit is set, binary zero is the replacement value of the "shifted out element". (v0[0] for slide1up and v0[vl-1] for slide1down).

Equivalences:

vslide1up.vvm vd, vs2, vs1, v0 replaces the two instruction sequence:

    vmv vd,vs1

    slide1up.vvm vd, vs2, x0, v0 

Similarly for vslide1down.vvm it is equivalent to the corresponding two instruction sequence.

Addresses design restriction:

To allow slide1up/slide1down to be restartable, the destination cannot overlap source vector group registers.. As a result without the .vvm variant, a vmv or equivalent is required to address this restriction when it arises.

Known application:

The DUPH operation in the FFTW3 library can be implemented in a single instruction with a mask of (0,1,0,1,....)

    vslide1up.vvm  vd, vs1, vs1, v0

Similarly the DUPL operation in a single operation (with a complement mask in v0)

Alternatives:

The two instruction equivalence above could be optimized via fusion or chaining. The RISCV synchronous interrupt requirement imposes considerable constraints on designs in vector implementation. In particular register renaming, re-buffering or deferring interrupts is required.

If the V synchronous processing were relaxed, numerous chaining, and especially fusing opportunities would be available to in-order non-speculative implementations. It would be my preference to prescribe such relaxation when only visible from interrupt context.

However, this slide1 .vvm case would still weigh toward its inclusion due to the minimal additional gates and thus being available to even the simplest implementations.

kasanovic commented 4 years ago

I can see the appeal but OTOH, this requires mask register is setup and might only make sense for even-odd interleaving which could end up being handled with EDIV?

David-Horner commented 4 years ago

This does not need to be decided before V1.0. Deferring also defers the consideration of reserving the non-mask variant as a duplicate of another instruction.