Vector instructions should support scalars from element 0 of a vector register (as well as from x & f registers)

tony-cole commented 3 years ago

There are a few vector instructions that take scalars from x or f registers, but not element 0 of a vector register.

My suggestion is to have the vector instructions take a scalar operand from element 0 of a vector as well as x & f registers.

This may not seem important at the moment with an SEW up to 64 bits, but for the future, I think, having this will provide optimal performance if the x or f registers have less bits than the SEW.

For instance, I am trying to implement a vector rotate by 1, as follows (https://godbolt.org/z/d4fcojTPo):

#include <riscv_vector.h>

vint64m8_t vrotate1down_vx_i64m8 (vint64m8_t op1, int vl)
{
    int64_t scalar;

    scalar = vmv_x_s_i64m8_i64(op1);
    op1 = vslide1down_vx_i64m8(op1, scalar, vl);

    return op1;
}

which, on LLVM 32-bit, correctly compiles to:

vrotate1down_vx_i64m8(__rvv_int64m8_t, int): # @vrotate1down_vx_i64m8(__rvv_int64m8_t, int)
        addi    a1, zero, 32
        vsetivli        a2, 1, e64,m8,ta,mu
        vsrl.vx v16, v8, a1
        vmv.x.s a1, v16
        vmv.x.s a2, v8
        slli    a0, a0, 1
        vsetvli a0, a0, e32,m8,ta,mu
        vslide1down.vx  v8, v8, a2
        vslide1down.vx  v8, v8, a1
        ret

Note the use of a vector shift, 2 moves, 2 X registers and 2 slide1down instructions. This would be 2x worse if SEW were 128-bits, 4x worse if SEW were 256-bits, etc.

It would be much faster as a single instruction taking the scalar from element 0 of a vector register, e.g.:

        vslide1down.vs  v8, v8, v8

        Where 's' (in the .vs) is the scalar (element 0) from a vector register as in vredsum.vs. 

        With a psudo instruction of: vrotate1down.vs

There are many other instruction that can benfit from this by keeping the data in the VPU.

nick-knight commented 3 years ago

I don't have the bandwidth right now for a solid game of code golf, but I think an alternative might be to splat the zeroth element (vrgather.vi) and then do a tail-undisturbed vslidedown.vi, with uimm = 1 and length VL-1, into the splatted vector.

Obviously this would require a tail-undisturbed slide --- possibly less performant on some implementations --- the extra instruction to splat, and the extra register pressure to hold the splatted element. Plus, there's the practical difficulty of accessing tail-undisturbed behavior from the (current) RVV intrinsics API...

If you really care about efficient vector rotations, a specialized instruction could be best. That's a better conversation to have over at #672.

nick-knight commented 3 years ago

To be clear, this is related to a more general RVV programming technique: a splat followed by .vv-type instructions. This is the "canonical" way (my opinion) of, say, scaling a vector by the result of a reduction-sum while avoiding data movement between scalar and vector register files (and losing decoupling between scalar and vector pipelines).

tony-cole commented 3 years ago

Thanks, I like the code golf analogy.

The rotation was the example that pointed out this issue - I have the function shown above and will try to think of better ways of doing the same thing, but if anyone else wants to play a fun game of code golf (or ping-pong), please let me know a better way of doing vrotate1down_vx_i64m8().

My main point was the omission of the ability to use scalars directly from element 0 of a vector (as well as from x & f registers). As you stated " losing decoupling between scalar and vector pipelines" would surely be a performance benefit?

nick-knight commented 3 years ago

I think we are in agreement. Your proposal is (fairly) clear: to augment existing vector instructions with .vs-type variants. I was pointing out a mechanism for simulating this behavior in current RVV, one that avoids loss of decoupling with a modest increase in register pressure and arithmetic. The question is whether the performance benefits of your proposal (vs. this mechanism) outweigh the costs of adding these instructions (assuming the encoding space can even admit them). This latter question I do not have the expertise to address.

tony-cole commented 3 years ago

Yes, we are in agreement and I don't have the expertise either to answer your question:

The question is whether the performance benefits of your proposal (vs. this mechanism) outweigh the costs of adding these instructions (assuming the encoding space can even admit them). This latter question I do not have the expertise to address.

An aside note to this issue: Thanks for the reminder about tail-undistrubed, I've not used that feature yet, but you gave me an idea for a better solution to my rotate function:

See: https://godbolt.org/z/j8e66GPrq

#include <riscv_vector.h>

vint64m8_t vrotate1down_vs_i64m8 (vint64m8_t vec, int vl)
{
    vint64m8_t vecTmp;

    /* Copy element 0 of vec to vecTmp */
    vecTmp = vmv_v_v_i64m8(vec, 1);

    /* Slide element 0 up to top position in vecTmp */
    vecTmp = vslideup_vx_i64m8(vecTmp, vecTmp, vl - 1, vl);

    /* Slide down vec by 1 and merge with vecTmp using Tail Undistrubed feature */
    vecTmp = vslidedown_vx_i64m8(vecTmp, vec, 1, vl - 1);

    return vecTmp;
}

which, on LLVM 32-bit, compiles to:

vrotate1down_vs_i64m8(__rvv_int64m8_t, int): # @vrotate1down_vs_i64m8(__rvv_int64m8_t, int)
        vsetivli        a1, 1, e64,m8,ta,mu
        vmv.v.v v24, v8
        addi    a1, a0, -1
        vsetvli a0, a0, e64,m8,tu,mu
        vmv8r.v v16, v24
        vslideup.vx     v16, v24, a1
        vsetvli a0, a1, e64,m8,tu,mu
        vslidedown.vi   v16, v8, 1
        vmv8r.v v8, v16
        ret

I have not tested this yet, but it keeps the scalar element 0 in the VPU and does not require multiple side1down instructions to "load" the scalar in to the vector top position.

Looks like the compiler has generated a load of unnecessary vector moves, but that's another issue...

aswaterman commented 3 years ago

Encoding space is a major concern. We've already exhausted the available funct3 patterns, so orthogonal vs support cannot be added without moving to another major opcode. This is a feature we should consider for a follow-on vector extension that uses 64-bit instructions. In the mean time, vrgather.vi is the way to go.

tony-cole commented 3 years ago

Thanks for the vrgather.vi hint, I had not properly looked at it's immediate/scalar functionality.

I assume vrgather.vi is generally faster than vslideup.vx?

For completeness, here is a more optimal solution, this uses the Tail Undisturbed policy (also see: https://github.com/riscv/riscv-v-spec/issues/664#issuecomment-836810691);

#include <riscv_vector.h>

vint64m8_t vrotate1down_v_i64m8 (vint64m8_t vec, int vl)
{
    vint64m8_t vecTmp;

    /* Splat element 0 to all active elements in vecTmp */
    vecTmp = vrgather_vx_i64m8(vec, 0, vl);

    /* Slide down vec by 1 and merge with vecTmp using Tail Undistrubed */
    vecTmp = vslidedown_vx_i64m8(vecTmp, vec, 1, vl - 1);

    return vecTmp;
}

vrotate1down_v_i64m8(__rvv_int64m8_t, int): # @vrotate1down_v_i64m8(__rvv_int64m8_t, int)
        vsetvli a1, a0, e64,m8,ta,mu
        vrgather.vi     v16, v8, 0
        addi    a0, a0, -1
        vsetvli a0, a0, e64,m8,tu,mu
        vslidedown.vi   v16, v8, 1
        vmv8r.v v8, v16
        ret

vint64m8_t vslide1down_vs_i64m8 (vint64m8_t vec, vint64m1_t scalar, int vl)
{
    vint64m8_t vecTmp;

    /* 'cast' extend scalar in to vecTmp */
    vecTmp = vlmul_ext_v_i64m1_i64m8 (scalar);

    /* Splat scalar element 0 to all active elements in vecTmp */
    vecTmp = vrgather_vx_i64m8(vecTmp, 0, vl);

    /* Slide down vec by 1 and merge with vecTmp using Tail Undistrubed */
    vecTmp = vslidedown_vx_i64m8(vecTmp, vec, 1, vl - 1);

    return vecTmp;
}

vslide1down_vs_i64m8(__rvv_int64m8_t, __rvv_int64m1_t, int): # @vslide1down_vs_i64m8(__rvv_int64m8_t, __rvv_int64m1_t, int)
        vsetvli a1, a0, e64,m8,ta,mu
        vrgather.vi     v24, v16, 0
        addi    a0, a0, -1
        vsetvli a0, a0, e64,m8,tu,mu
        vslidedown.vi   v24, v8, 1
        vmv8r.v v8, v24
        ret

https://godbolt.org/z/9sv8zPPj8

So, using vrgather.vi splating the selected element to all elements is best alternative for this issue. Doing things with vrgather.vi (instead of this issue's proposed support for scalars) will cost a little more CPU/VPU time and another vector register, which may cause additional register spills/restores.

aswaterman commented 3 years ago

@tony-cole On the implementations I've worked on, vrgather.vi and vrgather.vx are fast operations (comparable to doing a vadd.vi).

I still agree there's merit to your preferred solution; it just runs up against encoding space limitations. I do suspect that follow-on extensions to the vector extensions, with the benefit of longer instructions, will add something along the lines of what you're proposing.

kasanovic commented 3 years ago

Previous versions of the spec had this model with scalars being in element 0 of vector registers, and overlay of vector onto FP registers). For various reasons, including coping with existing FP ABI and making more registers available, we moved to current model but due to encoding space (and overall implementation complexity) could not continue to also accommodate scalars at element 0. As Andrew says, this is a possible component of future extensions but likely only with longer instruction encoding.

riscvarchive / riscv-v-spec

Vector instructions should support scalars from element 0 of a vector register (as well as from x & f registers) #671