riscv / riscv-v-spec

Working draft of the proposed RISC-V V vector extension
https://jira.riscv.org/browse/RVG-122
Creative Commons Attribution 4.0 International
958 stars 272 forks source link

drop tail undisturbed/mask agnostic #664

Closed solomatnikov closed 3 years ago

solomatnikov commented 3 years ago

Is there a reason to keep tail undisturbed?

Tail agnostic includes both tail undisturbed for in-order cores and tail one for out-of-order cores, therefore tail agnostic is good enough for all core implementations.

1) If tail undisturbed is required to be implemented in the out-of-order cores, then every vector load has to be treated as masked vector load, i.e. it should depend on/read previous vector register value/physical register, because in case of exception the rest of vector register, starting from vstart, must remain undisturbed. This can have performance impact on vector loads and add non-trivial design and verification complexity.

2) Some instructions always operate as tail agnostic, so dropping tail undisturbed will make ISA more consistent.

3) Additional mode has a non-trivial verification cost.

Similar arguments for dropping mask agnostic.

nick-knight commented 3 years ago

I'm currently using tail-undisturbed as part of a matrix repacking and zero-padding routine. In particular, I'm using tail-undisturbed loads into zero-initialized registers to simulate tail-zeroing.

I'm not arguing that my niche application justifies the tradeoffs you discuss, just mentioning that there is an application.

solomatnikov commented 3 years ago

@nick-knight this can be done with just a couple of instructions and mask undisturbed:

vsetvli t1, x0, vtypei
vmv.v.i v1, 0
vid.v v0
vmslt.vx v0, v0, t2 # t2 is the number of non-zero elements
# use v0 as a mask
nick-knight commented 3 years ago

Sure. At the cost of keeping around an extra mask register. My code doesn't need masking otherwise, so this increases register pressure in the inner loop. And I suspect on some implementations, masked forms may be less performant than unmasked forms (I don't have data to support this). I'm just saying, it's a tradeoff.

topperc commented 3 years ago

LLVM uses tail undisturbed to implement insertion of a small vector into a larger vector. Using the offset and the VL you can control where the insertion starts and ends.

solomatnikov commented 3 years ago

LLVM uses tail undisturbed to implement insertion of a small vector into a larger vector. Using the offset and the VL you can control where the insertion starts and ends.

Hmm, this looks like a hack.

vslide with mask should do it.

kito-cheng commented 3 years ago

IIRC @aswaterman told us vslide is slow? or maybe this it's very implementation dependent?

David-Horner commented 3 years ago

All the performance characteristics are implementation dependent. However, if tail undisturbed is one of the two fill options for virtually all instructions it is highly likely to be optimized. Whereas slide is just one of many instructions an thus has less impetuous to optimize/commit many transistors

On Thu, Apr 15, 2021, 22:59 Kito Cheng, @.***> wrote:

IIRC @aswaterman https://github.com/aswaterman told us vslide is slow? or maybe this its' very implementation dependent.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/riscv/riscv-v-spec/issues/664#issuecomment-820872476, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFAWIKNRSSHVIEKSBAPTEDLTI6R2FANCNFSM426SUE7Q .

solomatnikov commented 3 years ago

IIRC @aswaterman told us vslide is slow? or maybe this it's very implementation dependent?

Hacky use of tail undisturbed is likely to have even more performance problems in high perf OOO implementation.

aswaterman commented 3 years ago

The following heuristic should apply to any high-quality implementation: VSLIDE will be no slower than using VSE followed by dependent VLE.

On some implementations, VSLIDE will perform similar to simple ALU ops like VADD. On other implementations, it will be noticeably slower than VADD, depending on the slide distance. (I'd expect < 2x overhead in practice.)

On some implementations, the tail-undisturbed operations will have noticeable overhead, as @solomatnikov mentioned. This overhead generally shouldn't be extreme (often <= 1 cycle per instruction in practice) but will add up.

So, this might be -mtune territory. Both approaches should get decent performance on any implementation, but one or the other will be preferable for specific implementations.

David-Horner commented 3 years ago

On 2021-04-15 11:32 p.m., Alex Solomatnikov wrote:

IIRC @aswaterman told us vslide is slow? or maybe this it's very implementation dependent?

Hacky use of tail undisturbed is likely to have even more performance problems in high perf OOO implementation.

Essentially by definition, the actual use of tail undisturbed [whether hacky or not] is expected to have performance impact on OoO and non-OoO massive vector implementations.

Agnostic is the means to allow efficient implementation across all micro-archs.

If the facility provides no opportunity for optimization to the u-arch, then it can operate as undisturbed with no performance overhead.

It was an allowance specifically because it was anticipated that YMMV across multiple implementations.

Your concern is certainly more nuanced than this explanation gives credit.

However, the software folk early on suggested undisturbed would be useful and it thus became the default standard. Thus the basis for the tradeoff.

We now recommend that agnostic be the software standard and undisturbed be used only when needed to improve algorithms and hence performance.

This "hack" is thus intentional and quite in keeping with the hacking nature of toolchain and OS developers, many of whom take pride in squeezing out performance from even obscure ISA features.

solomatnikov commented 3 years ago

@nick-knight @topperc @kito-cheng could you post code sequences with actual values of vl and lmul? It would be useful for discussion and performance analysis.

rofirrim commented 3 years ago

A data point that may not be very conclusive, in case it helps.

Reductions benefit from tail undisturbed when using vl to avoid having epilog code. We can easily accumulate partial reductions (using vl) to a vector that has been initialised (under vl=VLMAX) with the neuter element. That said, this is the only scenario I've seen where it is very convenient (at least for the programmer) to have this behaviour. Almost every other code I've seen seems to do fine with tail agnostic (as in the code does not care about the tail).

That said, one could argue that this particular algorithm is not suited to be implemented without epilog code and we could do better if we do a first VLMAX loop (so the tail policy is not relevant) followed by epilog code that does a final accumulation (the partial vector reduced from the previous VLMAX loop and the values from the epilog) using tail-agnostic and mask-undisturbed (followed by the reduction itself v{,f}red<op>.vs).

Dot products are essentially reductions and they seem important in many applications these days.

David-Horner commented 3 years ago

Undisturbed allows for avoiding register spills. Software can retain data in the "tail" of a register to be used in subsequent operations without memory ops. This may be its most beneficial use, and a sufficiently positive use for it to have justified including it [or so thought the TG]. However, software folk are ingenious and other important variations and usages of undisturbed are undoubtedly discoverable.

nick-knight commented 3 years ago

If I'm not mistaken, we just saw another use-case of tail-undisturbed in the conversation at #671: implementing circular shift. As in other examples, it could also be simulated by using masking and mask-undisturbed, at the extra cost of initializing and maintaining the mask.

tony-cole commented 3 years ago

I've have some uses for Tail-Undisturbed which produces fairly optimal code (no setting up and usage of an additional mask register, thus reducing register spills), I think tu is a useful feature as it allows me to work on part of a vector in-situ:

https://godbolt.org/z/cdcavjTs1

Note: these have not been tested yet.

#include <riscv_vector.h>

/* Examples of using Tail-Undisturbed to perfrom operations which otherwise would require
   additional setup, vector registers and CPU/VPU time to perform */

vint64m8_t vrotatedown_vx_i64m8 (vint64m8_t src, size_t offset, size_t vl)
{
    vint64m8_t tmp;
    size_t     offset_diff = vl - offset;

    /* Slide elements 0..(offset - 1) of src up to top position in tmp (offset_diff) */
    tmp = vslideup_vx_i64m8(tmp, src, offset_diff, vl);

    /* THIS NEEDS Tail Undisturbed for it to work.
       Fortunately tu is automatically set in the vsetvli instruction */

    /* Slide down src by offset and merge with tmp using Tail Undistrubed */
    tmp = vslidedown_vx_i64m8(tmp, src, offset, offset_diff);

    return tmp;
}

vrotatedown_vx_i64m8(__rvv_int64m8_t, unsigned int, unsigned int): # @vrotatedown_vx_i64m8(__rvv_int64m8_t, unsigned int, unsigned int)
        sub     a2, a1, a0
        vsetvli a1, a1, e64,m8,ta,mu
        vslideup.vx     v16, v8, a2
        vsetvli a1, a2, e64,m8,tu,mu
        vslidedown.vx   v16, v8, a0
        vmv8r.v v8, v16
        ret

/* A more general version of vrotatedown_vx is to slide 2 src operands. 
   This perfroms rotate when src1 = src2 */
vint64m8_t vslidedown_vv_i64m8 (vint64m8_t src1, vint64m8_t src2, size_t offset, size_t vl)
{
    vint64m8_t tmp;
    size_t     offset_diff = vl - offset;

    /* Slide elements 0..(offset - 1) of src2 up to top position in tmp (offset_diff) */
    tmp = vslideup_vx_i64m8(tmp, src2, offset_diff, vl);

    /* THIS NEEDS Tail Undisturbed for it to work.
       Fortunately tu is automatically set in the vsetvli instruction */

    /* Slide down src1 by offset and merge with tmp using Tail Undistrubed */
    tmp = vslidedown_vx_i64m8(tmp, src1, offset, offset_diff);

    return tmp;
}

vslidedown_vv_i64m8(__rvv_int64m8_t, __rvv_int64m8_t, unsigned int, unsigned int): # @vslidedown_vv_i64m8(__rvv_int64m8_t, __rvv_int64m8_t, unsigned int, unsigned int)
        sub     a2, a1, a0
        vsetvli a1, a1, e64,m8,ta,mu
        vslideup.vx     v24, v16, a2
        vsetvli a1, a2, e64,m8,tu,mu
        vslidedown.vx   v24, v8, a0
        vmv8r.v v8, v24
        ret

vint64m8_t vmerge_vv_i64m8 (vint64m8_t src_low, vint64m8_t src_high, size_t offset, size_t vl)
{
    vint64m8_t tmp;

    tmp = src_high;

    /* THIS NEEDS Tail Undisturbed for it to work.
       I can't find a way to set tu in vsetvli instruction with the current intrinsics,
       so THIS FUNCTION WILL NOT WORK when using vmv_v_v_i64m8(), 
       using vslidedown_vx_i64m8() instead which automatically sets tu */

    /* Merge src_low elements 0..(offset - 1) with tmp (src_high) elements offset..(vl - 1) using Tail Undistrubed */
    // tmp = vmv_v_v_i64m8(src_low, offset); ta is set, but we need tu
    tmp = vslidedown_vx_i64m8(tmp, src_low, 0, offset);

    return tmp;
}

vmerge_vv_i64m8(__rvv_int64m8_t, __rvv_int64m8_t, unsigned int, unsigned int): # @vmerge_vv_i64m8(__rvv_int64m8_t, __rvv_int64m8_t, unsigned int, unsigned int)
        vsetvli a0, a0, e64,m8,tu,mu
        vslidedown.vi   v16, v8, 0
        vmv8r.v v8, v16
        ret

Just need a way to set tu with the intrinsics...

kasanovic commented 3 years ago

We're keeping undisturbed.

solomatnikov commented 3 years ago

according to compiler engineer tail undisturbed is completely useless for vectorizing compiler

rofirrim commented 3 years ago

according to compiler engineer tail undisturbed is completely useless for vectorizing compiler

Completely useless is in my opnion a bit of an extreme stance: I'd say that the common approaches to vectorization do not benefit from undisturbed policy in the tail.

However as I've stated above, reductions do benefit from it. Think of the partial reductions being accumulated on a register fully initialised before the loop (with vlmax) and then partially accumulating on that vector, the last iterations may not accumulate all the elements but you still want to preserve their earlier values (earlier partial values). After the loop the final reduction is done horizontally with this vector (using a v{,f}red<op>.vs instruction, again with vlmax). If we do not have undisturbed, partial accumulation is a bit more involved as it requires doing the partial accumulation and then merge it with partial reductions vector with a mask vid < vl. (Perhaps at the CPU implementation level that would not be a big deal?)

Beyond vectorization, others have pointed there are uses for tail undisturbed.

(Now if you ask me: we could get rid of the mask policy and we could hardcode it to mask undisturbed. I don't think anyone has presented a use case for mask agnostic)

kasanovic commented 3 years ago

Not all vector code is generated by a vectorizing compiler.

tony-cole commented 3 years ago

I am relieved that Tail Undisturbed will remain in the specification as I use it a lot (actually most of my code). But I was wondering why this issue exists in the first place and how, using Tail Agnostic, code can be as fast (and compact) as using Tail Undisturbed for, say, the following vector max code (as an example):

https://godbolt.org/z/d9xMTj3rd

#include <float.h>
#include <riscv_vector.h>

void max_flt( const float *p_src, uint32_t len, float *p_res )
{
   float   new_val;
   float   max_val = FLT_MIN;

   while ( len > 0 )
   {
       new_val = *p_src++;

       if ( max_val < new_val )
       {
           max_val = new_val;
       }

       len --;
   }

   *p_res = max_val;
}

Also, what does the output of a vectorizing compiler look like for the above code? How does it do it?

Here is my Tail Undisturbed version and assembly output: https://godbolt.org/z/xdaf1EMr8

#include <float.h>
#include <riscv_vector.h>

/* Find the maximum value in the vector *p_src of length len, using RISC-V Vector strip-mining */
void max_flt( const float *p_src, uint32_t len, float *p_res )
{
    uint32_t        len_ctr;    /* The remaining number of elements to process */
    vfloat32m8_t    v_new_vals; /* New values read from memory */
    vfloat32m8_t    v_max_vals; /* Current maximum values */
    vfloat32m1_t    v_max_val;  /* Final max value */
    size_t          vl;

    len_ctr = len;

    v_max_vals = vfmv_v_f_f32m8( FLT_MIN, len_ctr ); /* Set the max vector to the minimum value */
    v_max_val  = vfmv_v_f_f32m1( FLT_MIN,       1 ); /* Set the final reduction result to the minimum value */

    while ( len_ctr > 0 )
    {
        vl = vsetvl_e32m8( len_ctr );

        /* Load the next values from memory */
        v_new_vals = vle32_v_f32m8( p_src, len_ctr );

        /* Find the maximum of the current and new values *** REQUIRES TAIL UNDISTURBED *** */
        v_max_vals = vfmax_vv_f32m8( v_max_vals, v_new_vals, len_ctr );

        /* Update the pointer and remaining length */
        p_src   += vl;
        len_ctr -= vl;
    }

    /* Reduce the max vector to the scalar result */
    v_max_val = vfredmax_vs_f32m8_f32m1( v_max_val, v_max_vals, v_max_val, len );

    /* Save the scalar max result */
    vse32_v_f32m1( p_res, v_max_val, 1 );
}

Using LLVM compiles to:

.LCPI0_0:
        .word   0x00800000                      # float 1.17549435E-38
max_flt(float const*, unsigned int, float*):                       # @max_flt(float const*, unsigned int, float*)
        lui     a3, %hi(.LCPI0_0)
        flw     ft0, %lo(.LCPI0_0)(a3)
        vsetvli zero, a1, e32, m8, ta, mu
        vfmv.v.f        v8, ft0
        vsetivli        zero, 1, e32, m1, ta, mu
        vfmv.v.f        v25, ft0
        beqz    a1, .LBB0_3
        mv      a3, a1
.LBB0_2:                                # =>This Inner Loop Header: Depth=1
        vsetvli a4, a3, e32, m8, ta, mu  ##### This should be tu TAIL UNDISTURBED (currently no way to specify in LLVM)
        vle32.v v16, (a0)
        vfmax.vv        v8, v8, v16
        slli    a5, a4, 2
        sub     a3, a3, a4
        add     a0, a0, a5
        bnez    a3, .LBB0_2
.LBB0_3:
        vsetvli zero, a1, e32, m8, tu, mu
        vfredmax.vs     v25, v8, v25
        vsetivli        zero, 1, e32, m1, ta, mu
        vse32.v v25, (a2)
        ret

If someone can use the vector intrinsics and port this function using Tail Agnostic (& LLVM) in the most optimal way possible, then we can compare and discuss the issues so I (and others) can better understand the implications on different platforms.

nick-knight commented 3 years ago

I think this is a great use-case for tail undisturbed. The point @solomatnikov was raising with this Github Issue is that tail-undisturbed use-cases, like yours, can be simulated using tail-agnostic and mask-undisturbed. Here is the prototypical code sequence he suggested earlier in this thread:

vsetvli t1, x0, vtypei
vmv.v.i v1, 0
vid.v v0
vmslt.vx v0, v0, t2 # t2 is the number of non-zero elements
# use v0 as a mask

Integrating this into a strip-mined loop like yours will require adding some "boilerplate" code reminiscent of fringe-case handling in the SIMD ISAs of yore. In other words, I claim that a tail-undisturbed implementation like yours leads to more compact code than a tail-agnostic implementation.

I feel this addresses your question of program compactness. You also asked about speed. That is implementation-dependent. For example, an implementation could execute tail-undisturbed instructions much more slowly than tail-agnostic mask-undisturbed instructions, plus the aforementioned boilerplate/overhead. (I'm not claiming this is a useful design point, just that it's possible.)

A third question you asked concerns the autovectorization capabilities of current compilers. I'm afraid I can't address this one.

And in case anyone would like context on comments in your code snippets concerning limitations of the RVV intrinsics API, there is extensive discussion over at https://github.com/riscv/rvv-intrinsic-doc/issues/27 (my summary: https://github.com/riscv/rvv-intrinsic-doc/issues/27#issuecomment-870713856).

tony-cole commented 3 years ago

@nick-knight thank you. I have another question. You said:

You also asked about speed. That is implementation-dependent. For example, an implementation could execute tail-undisturbed instructions much more slowly than tail-agnostic mask-undisturbed instructions, plus the aforementioned boilerplate/overhead.

Would there be a difference in speed between Tail Agnostic and Undisturbed when the vector length vl = VLMAX? I would assume they would be the same speed?

If they are the same speed then my code will be faster for all iterations except the last when vl < VLMAX.

I also don't see why Mask Undisturbed will be faster than Tail Undisturbed when vl = VLMAX? I would assume they would be the same speed.

Anyway, this is a simple example. In more complex code I have used masking (undisturbed) and require Tail Undisturbed as well when vl < VLMAX. Setting up the mask to also cater for Tail Undisturbed will add extra code and registers (making it more complex, bigger, slower on vector units that don't have agnostic speed ups, use more registers and making it less elegant).

That's the point, Tail Agnostic designed code will be slower on vector units that don't have register renaming or other agnostic speed ups. So as programmers we will have to produce two performance versions, something I think the Vector design team are against.

nick-knight commented 3 years ago

I'm not clear what you are getting at here: are you hoping for an architectural guarantee that the relative speed of tail-undisturbed must be sufficiently close to that of tail-agnostic? Currently the RISC-V ISA says nothing, anywhere, about relative instruction speeds, and I don't see this changing. The architecture describes the functionality, the microarchitecture decides the speeds, and implementors risk uncompetitive products by introducing performance cliffs. From my (cynical) software perspective, it's best to write as much tail-undisturbed software as possible, driving the market forces to ultimately reduce performance cliffs.

Pragmatically, my strategy is to write tail-undisturbed code where it is appropriate (like your example), but to revisit this decision when tuning for a particular microarchitecture. In that context, I'm already dealing with several "performance versions". For example, if the cost of vfredmax is sufficiently large relative to vfmax --- which depends on the implementation --- I may want to consider targeting LMUL < 8. Additionally, I am considering a long list of variations for cases of small application vectors (len), where "small" is understood relative to things like VLEN and datapath width(s). These are all decisions that a vectorizing compiler could (in principle) automate, but the RVV intrinsics (or assembly) programmer has to address explicitly.

kasanovic commented 3 years ago

Just to add to this closed issue discussion, the conversation seems to be missing the obvious point that in-order implementations can implement tail-undisturbed at full speed with no cost. If we were to force these smaller implementations to execute longer code sequences to synthesize tail-undisturbed when useful, they would be slower (not to mention code size growth issue).

Of course, assumption throughout is that software should only use tu when it is actually needed.

To reduce implementation complexity, renamed machines can synthesize tail-undisturbed using the underlying microarchitecture needed for mask-undisturbed (e.g.., execute tail-undisturbed as a full-length mask-undisturbed uop with an internal mask generated from vl - basically generating the suggested code sequence on the fly). This still adds some complexity versus only supporting tail-agnostic, but should not perform worse than executing the equivalent macro code sequence. If tu is not commonly used in code, then this slow implementation would be fine. If people actually use it (as I suspect they will), then further optimizations will be found to be necessary, but then having to run common instructions fast is not exactly a bad thing.

solomatnikov commented 3 years ago

Did x86 AAA instruction have use (and abuse) cases? - Yes, https://www.hugi.scene.org/online/coding/hugi%2017%20-%20coaax.htm

Did it make the code that used it more compact? - Yes.

Did it reduce the number of registers in the code that used it? - Most likely.

Why is it not in RISC-V ISA? - Because it doesn't make sense to include an instruction just because there are some marginal, rare use cases (with obvious and efficient workarounds).

Another example is VAX: "The INDEX instruction is used to calculate the address of an array element while at the same time checking to see that the index fits in the array bounds. This is clearly an important function to accurately detect errors in high-level languages statements."

Why is the INDEX instruction not in RISC-V ISA? - Because it doesn't make sense if you believe the arguments from this paper: D. Patterson, D. Ditzel, "The Case for the Reduced Instruction Set Computer".

I argue that the complexity cost of having tail undisturbed in addition to tail agnostic is significantly worse than AAA and INDEX because it affects most of RVV instructions, which means verification space explosion.

It also contradicts the open source spirit by raising the complexity barrier and preventing a variety of implementations.

Another downside: confused SW people will use (and abuse) tail undisturbed in all kinds of unnecessary ways (as evident from this thread). And then they will be even more confused by subtle, hard to predict/hard to explain performance issues of tail undisturbed in high performance RVV implementations.

solomatnikov commented 3 years ago

Also, with renaming and tail undisturbed some widening instructions need to read 5 vector registers instead of at most 4 without tail undisturbed.

Register file ports are expensive, which means some implementations might choose to execute such instructions at half throughput in tail undisturbed mode.