Closed solomatnikov closed 3 years ago
I'm currently using tail-undisturbed as part of a matrix repacking and zero-padding routine. In particular, I'm using tail-undisturbed loads into zero-initialized registers to simulate tail-zeroing.
I'm not arguing that my niche application justifies the tradeoffs you discuss, just mentioning that there is an application.
@nick-knight this can be done with just a couple of instructions and mask undisturbed:
vsetvli t1, x0, vtypei
vmv.v.i v1, 0
vid.v v0
vmslt.vx v0, v0, t2 # t2 is the number of non-zero elements
# use v0 as a mask
Sure. At the cost of keeping around an extra mask register. My code doesn't need masking otherwise, so this increases register pressure in the inner loop. And I suspect on some implementations, masked forms may be less performant than unmasked forms (I don't have data to support this). I'm just saying, it's a tradeoff.
LLVM uses tail undisturbed to implement insertion of a small vector into a larger vector. Using the offset and the VL you can control where the insertion starts and ends.
LLVM uses tail undisturbed to implement insertion of a small vector into a larger vector. Using the offset and the VL you can control where the insertion starts and ends.
Hmm, this looks like a hack.
vslide
with mask should do it.
IIRC @aswaterman told us vslide
is slow? or maybe this it's very implementation dependent?
All the performance characteristics are implementation dependent. However, if tail undisturbed is one of the two fill options for virtually all instructions it is highly likely to be optimized. Whereas slide is just one of many instructions an thus has less impetuous to optimize/commit many transistors
On Thu, Apr 15, 2021, 22:59 Kito Cheng, @.***> wrote:
IIRC @aswaterman https://github.com/aswaterman told us vslide is slow? or maybe this its' very implementation dependent.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/riscv/riscv-v-spec/issues/664#issuecomment-820872476, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFAWIKNRSSHVIEKSBAPTEDLTI6R2FANCNFSM426SUE7Q .
IIRC @aswaterman told us
vslide
is slow? or maybe this it's very implementation dependent?
Hacky use of tail undisturbed is likely to have even more performance problems in high perf OOO implementation.
The following heuristic should apply to any high-quality implementation: VSLIDE will be no slower than using VSE followed by dependent VLE.
On some implementations, VSLIDE will perform similar to simple ALU ops like VADD. On other implementations, it will be noticeably slower than VADD, depending on the slide distance. (I'd expect < 2x overhead in practice.)
On some implementations, the tail-undisturbed operations will have noticeable overhead, as @solomatnikov mentioned. This overhead generally shouldn't be extreme (often <= 1 cycle per instruction in practice) but will add up.
So, this might be -mtune
territory. Both approaches should get decent performance on any implementation, but one or the other will be preferable for specific implementations.
On 2021-04-15 11:32 p.m., Alex Solomatnikov wrote:
IIRC @aswaterman told us vslide is slow? or maybe this it's very implementation dependent?
Hacky use of tail undisturbed is likely to have even more performance problems in high perf OOO implementation.
Essentially by definition, the actual use of tail undisturbed [whether hacky or not] is expected to have performance impact on OoO and non-OoO massive vector implementations.
Agnostic is the means to allow efficient implementation across all micro-archs.
If the facility provides no opportunity for optimization to the u-arch, then it can operate as undisturbed with no performance overhead.
It was an allowance specifically because it was anticipated that YMMV across multiple implementations.
Your concern is certainly more nuanced than this explanation gives credit.
However, the software folk early on suggested undisturbed would be useful and it thus became the default standard. Thus the basis for the tradeoff.
We now recommend that agnostic be the software standard and undisturbed be used only when needed to improve algorithms and hence performance.
This "hack" is thus intentional and quite in keeping with the hacking nature of toolchain and OS developers, many of whom take pride in squeezing out performance from even obscure ISA features.
@nick-knight @topperc @kito-cheng could you post code sequences with actual values of vl
and lmul
? It would be useful for discussion and performance analysis.
A data point that may not be very conclusive, in case it helps.
Reductions benefit from tail undisturbed when using vl
to avoid having epilog code. We can easily accumulate partial reductions (using vl
) to a vector that has been initialised (under vl=VLMAX
) with the neuter element. That said, this is the only scenario I've seen where it is very convenient (at least for the programmer) to have this behaviour. Almost every other code I've seen seems to do fine with tail agnostic (as in the code does not care about the tail).
That said, one could argue that this particular algorithm is not suited to be implemented without epilog code and we could do better if we do a first VLMAX loop (so the tail policy is not relevant) followed by epilog code that does a final accumulation (the partial vector reduced from the previous VLMAX loop and the values from the epilog) using tail-agnostic and mask-undisturbed (followed by the reduction itself v{,f}red<op>.vs
).
Dot products are essentially reductions and they seem important in many applications these days.
Undisturbed allows for avoiding register spills. Software can retain data in the "tail" of a register to be used in subsequent operations without memory ops. This may be its most beneficial use, and a sufficiently positive use for it to have justified including it [or so thought the TG]. However, software folk are ingenious and other important variations and usages of undisturbed are undoubtedly discoverable.
If I'm not mistaken, we just saw another use-case of tail-undisturbed in the conversation at #671: implementing circular shift. As in other examples, it could also be simulated by using masking and mask-undisturbed, at the extra cost of initializing and maintaining the mask.
I've have some uses for Tail-Undisturbed which produces fairly optimal code (no setting up and usage of an additional mask register, thus reducing register spills), I think tu is a useful feature as it allows me to work on part of a vector in-situ:
https://godbolt.org/z/cdcavjTs1
Note: these have not been tested yet.
#include <riscv_vector.h>
/* Examples of using Tail-Undisturbed to perfrom operations which otherwise would require
additional setup, vector registers and CPU/VPU time to perform */
vint64m8_t vrotatedown_vx_i64m8 (vint64m8_t src, size_t offset, size_t vl)
{
vint64m8_t tmp;
size_t offset_diff = vl - offset;
/* Slide elements 0..(offset - 1) of src up to top position in tmp (offset_diff) */
tmp = vslideup_vx_i64m8(tmp, src, offset_diff, vl);
/* THIS NEEDS Tail Undisturbed for it to work.
Fortunately tu is automatically set in the vsetvli instruction */
/* Slide down src by offset and merge with tmp using Tail Undistrubed */
tmp = vslidedown_vx_i64m8(tmp, src, offset, offset_diff);
return tmp;
}
vrotatedown_vx_i64m8(__rvv_int64m8_t, unsigned int, unsigned int): # @vrotatedown_vx_i64m8(__rvv_int64m8_t, unsigned int, unsigned int)
sub a2, a1, a0
vsetvli a1, a1, e64,m8,ta,mu
vslideup.vx v16, v8, a2
vsetvli a1, a2, e64,m8,tu,mu
vslidedown.vx v16, v8, a0
vmv8r.v v8, v16
ret
/* A more general version of vrotatedown_vx is to slide 2 src operands.
This perfroms rotate when src1 = src2 */
vint64m8_t vslidedown_vv_i64m8 (vint64m8_t src1, vint64m8_t src2, size_t offset, size_t vl)
{
vint64m8_t tmp;
size_t offset_diff = vl - offset;
/* Slide elements 0..(offset - 1) of src2 up to top position in tmp (offset_diff) */
tmp = vslideup_vx_i64m8(tmp, src2, offset_diff, vl);
/* THIS NEEDS Tail Undisturbed for it to work.
Fortunately tu is automatically set in the vsetvli instruction */
/* Slide down src1 by offset and merge with tmp using Tail Undistrubed */
tmp = vslidedown_vx_i64m8(tmp, src1, offset, offset_diff);
return tmp;
}
vslidedown_vv_i64m8(__rvv_int64m8_t, __rvv_int64m8_t, unsigned int, unsigned int): # @vslidedown_vv_i64m8(__rvv_int64m8_t, __rvv_int64m8_t, unsigned int, unsigned int)
sub a2, a1, a0
vsetvli a1, a1, e64,m8,ta,mu
vslideup.vx v24, v16, a2
vsetvli a1, a2, e64,m8,tu,mu
vslidedown.vx v24, v8, a0
vmv8r.v v8, v24
ret
vint64m8_t vmerge_vv_i64m8 (vint64m8_t src_low, vint64m8_t src_high, size_t offset, size_t vl)
{
vint64m8_t tmp;
tmp = src_high;
/* THIS NEEDS Tail Undisturbed for it to work.
I can't find a way to set tu in vsetvli instruction with the current intrinsics,
so THIS FUNCTION WILL NOT WORK when using vmv_v_v_i64m8(),
using vslidedown_vx_i64m8() instead which automatically sets tu */
/* Merge src_low elements 0..(offset - 1) with tmp (src_high) elements offset..(vl - 1) using Tail Undistrubed */
// tmp = vmv_v_v_i64m8(src_low, offset); ta is set, but we need tu
tmp = vslidedown_vx_i64m8(tmp, src_low, 0, offset);
return tmp;
}
vmerge_vv_i64m8(__rvv_int64m8_t, __rvv_int64m8_t, unsigned int, unsigned int): # @vmerge_vv_i64m8(__rvv_int64m8_t, __rvv_int64m8_t, unsigned int, unsigned int)
vsetvli a0, a0, e64,m8,tu,mu
vslidedown.vi v16, v8, 0
vmv8r.v v8, v16
ret
Just need a way to set tu with the intrinsics...
We're keeping undisturbed.
according to compiler engineer tail undisturbed is completely useless for vectorizing compiler
according to compiler engineer tail undisturbed is completely useless for vectorizing compiler
Completely useless is in my opnion a bit of an extreme stance: I'd say that the common approaches to vectorization do not benefit from undisturbed policy in the tail.
However as I've stated above, reductions do benefit from it. Think of the partial reductions being accumulated on a register fully initialised before the loop (with vlmax) and then partially accumulating on that vector, the last iterations may not accumulate all the elements but you still want to preserve their earlier values (earlier partial values). After the loop the final reduction is done horizontally with this vector (using a v{,f}red<op>.vs
instruction, again with vlmax). If we do not have undisturbed, partial accumulation is a bit more involved as it requires doing the partial accumulation and then merge it with partial reductions vector with a mask vid < vl
. (Perhaps at the CPU implementation level that would not be a big deal?)
Beyond vectorization, others have pointed there are uses for tail undisturbed.
(Now if you ask me: we could get rid of the mask policy and we could hardcode it to mask undisturbed. I don't think anyone has presented a use case for mask agnostic)
Not all vector code is generated by a vectorizing compiler.
I am relieved that Tail Undisturbed will remain in the specification as I use it a lot (actually most of my code). But I was wondering why this issue exists in the first place and how, using Tail Agnostic, code can be as fast (and compact) as using Tail Undisturbed for, say, the following vector max code (as an example):
https://godbolt.org/z/d9xMTj3rd
#include <float.h>
#include <riscv_vector.h>
void max_flt( const float *p_src, uint32_t len, float *p_res )
{
float new_val;
float max_val = FLT_MIN;
while ( len > 0 )
{
new_val = *p_src++;
if ( max_val < new_val )
{
max_val = new_val;
}
len --;
}
*p_res = max_val;
}
Also, what does the output of a vectorizing compiler look like for the above code? How does it do it?
Here is my Tail Undisturbed version and assembly output: https://godbolt.org/z/xdaf1EMr8
#include <float.h>
#include <riscv_vector.h>
/* Find the maximum value in the vector *p_src of length len, using RISC-V Vector strip-mining */
void max_flt( const float *p_src, uint32_t len, float *p_res )
{
uint32_t len_ctr; /* The remaining number of elements to process */
vfloat32m8_t v_new_vals; /* New values read from memory */
vfloat32m8_t v_max_vals; /* Current maximum values */
vfloat32m1_t v_max_val; /* Final max value */
size_t vl;
len_ctr = len;
v_max_vals = vfmv_v_f_f32m8( FLT_MIN, len_ctr ); /* Set the max vector to the minimum value */
v_max_val = vfmv_v_f_f32m1( FLT_MIN, 1 ); /* Set the final reduction result to the minimum value */
while ( len_ctr > 0 )
{
vl = vsetvl_e32m8( len_ctr );
/* Load the next values from memory */
v_new_vals = vle32_v_f32m8( p_src, len_ctr );
/* Find the maximum of the current and new values *** REQUIRES TAIL UNDISTURBED *** */
v_max_vals = vfmax_vv_f32m8( v_max_vals, v_new_vals, len_ctr );
/* Update the pointer and remaining length */
p_src += vl;
len_ctr -= vl;
}
/* Reduce the max vector to the scalar result */
v_max_val = vfredmax_vs_f32m8_f32m1( v_max_val, v_max_vals, v_max_val, len );
/* Save the scalar max result */
vse32_v_f32m1( p_res, v_max_val, 1 );
}
Using LLVM compiles to:
.LCPI0_0:
.word 0x00800000 # float 1.17549435E-38
max_flt(float const*, unsigned int, float*): # @max_flt(float const*, unsigned int, float*)
lui a3, %hi(.LCPI0_0)
flw ft0, %lo(.LCPI0_0)(a3)
vsetvli zero, a1, e32, m8, ta, mu
vfmv.v.f v8, ft0
vsetivli zero, 1, e32, m1, ta, mu
vfmv.v.f v25, ft0
beqz a1, .LBB0_3
mv a3, a1
.LBB0_2: # =>This Inner Loop Header: Depth=1
vsetvli a4, a3, e32, m8, ta, mu ##### This should be tu TAIL UNDISTURBED (currently no way to specify in LLVM)
vle32.v v16, (a0)
vfmax.vv v8, v8, v16
slli a5, a4, 2
sub a3, a3, a4
add a0, a0, a5
bnez a3, .LBB0_2
.LBB0_3:
vsetvli zero, a1, e32, m8, tu, mu
vfredmax.vs v25, v8, v25
vsetivli zero, 1, e32, m1, ta, mu
vse32.v v25, (a2)
ret
If someone can use the vector intrinsics and port this function using Tail Agnostic (& LLVM) in the most optimal way possible, then we can compare and discuss the issues so I (and others) can better understand the implications on different platforms.
I think this is a great use-case for tail undisturbed. The point @solomatnikov was raising with this Github Issue is that tail-undisturbed use-cases, like yours, can be simulated using tail-agnostic and mask-undisturbed. Here is the prototypical code sequence he suggested earlier in this thread:
vsetvli t1, x0, vtypei
vmv.v.i v1, 0
vid.v v0
vmslt.vx v0, v0, t2 # t2 is the number of non-zero elements
# use v0 as a mask
Integrating this into a strip-mined loop like yours will require adding some "boilerplate" code reminiscent of fringe-case handling in the SIMD ISAs of yore. In other words, I claim that a tail-undisturbed implementation like yours leads to more compact code than a tail-agnostic implementation.
I feel this addresses your question of program compactness. You also asked about speed. That is implementation-dependent. For example, an implementation could execute tail-undisturbed instructions much more slowly than tail-agnostic mask-undisturbed instructions, plus the aforementioned boilerplate/overhead. (I'm not claiming this is a useful design point, just that it's possible.)
A third question you asked concerns the autovectorization capabilities of current compilers. I'm afraid I can't address this one.
And in case anyone would like context on comments in your code snippets concerning limitations of the RVV intrinsics API, there is extensive discussion over at https://github.com/riscv/rvv-intrinsic-doc/issues/27 (my summary: https://github.com/riscv/rvv-intrinsic-doc/issues/27#issuecomment-870713856).
@nick-knight thank you. I have another question. You said:
You also asked about speed. That is implementation-dependent. For example, an implementation could execute tail-undisturbed instructions much more slowly than tail-agnostic mask-undisturbed instructions, plus the aforementioned boilerplate/overhead.
Would there be a difference in speed between Tail Agnostic and Undisturbed when the vector length vl = VLMAX? I would assume they would be the same speed?
If they are the same speed then my code will be faster for all iterations except the last when vl < VLMAX.
I also don't see why Mask Undisturbed will be faster than Tail Undisturbed when vl = VLMAX? I would assume they would be the same speed.
Anyway, this is a simple example. In more complex code I have used masking (undisturbed) and require Tail Undisturbed as well when vl < VLMAX. Setting up the mask to also cater for Tail Undisturbed will add extra code and registers (making it more complex, bigger, slower on vector units that don't have agnostic speed ups, use more registers and making it less elegant).
That's the point, Tail Agnostic designed code will be slower on vector units that don't have register renaming or other agnostic speed ups. So as programmers we will have to produce two performance versions, something I think the Vector design team are against.
I'm not clear what you are getting at here: are you hoping for an architectural guarantee that the relative speed of tail-undisturbed must be sufficiently close to that of tail-agnostic? Currently the RISC-V ISA says nothing, anywhere, about relative instruction speeds, and I don't see this changing. The architecture describes the functionality, the microarchitecture decides the speeds, and implementors risk uncompetitive products by introducing performance cliffs. From my (cynical) software perspective, it's best to write as much tail-undisturbed software as possible, driving the market forces to ultimately reduce performance cliffs.
Pragmatically, my strategy is to write tail-undisturbed code where it is appropriate (like your example), but to revisit this decision when tuning for a particular microarchitecture. In that context, I'm already dealing with several "performance versions". For example, if the cost of vfredmax
is sufficiently large relative to vfmax
--- which depends on the implementation --- I may want to consider targeting LMUL < 8. Additionally, I am considering a long list of variations for cases of small application vectors (len
), where "small" is understood relative to things like VLEN and datapath width(s). These are all decisions that a vectorizing compiler could (in principle) automate, but the RVV intrinsics (or assembly) programmer has to address explicitly.
Just to add to this closed issue discussion, the conversation seems to be missing the obvious point that in-order implementations can implement tail-undisturbed at full speed with no cost. If we were to force these smaller implementations to execute longer code sequences to synthesize tail-undisturbed when useful, they would be slower (not to mention code size growth issue).
Of course, assumption throughout is that software should only use tu when it is actually needed.
To reduce implementation complexity, renamed machines can synthesize tail-undisturbed using the underlying microarchitecture needed for mask-undisturbed (e.g.., execute tail-undisturbed as a full-length mask-undisturbed uop with an internal mask generated from vl - basically generating the suggested code sequence on the fly). This still adds some complexity versus only supporting tail-agnostic, but should not perform worse than executing the equivalent macro code sequence. If tu is not commonly used in code, then this slow implementation would be fine. If people actually use it (as I suspect they will), then further optimizations will be found to be necessary, but then having to run common instructions fast is not exactly a bad thing.
Did x86 AAA
instruction have use (and abuse) cases? - Yes, https://www.hugi.scene.org/online/coding/hugi%2017%20-%20coaax.htm
Did it make the code that used it more compact? - Yes.
Did it reduce the number of registers in the code that used it? - Most likely.
Why is it not in RISC-V ISA? - Because it doesn't make sense to include an instruction just because there are some marginal, rare use cases (with obvious and efficient workarounds).
Another example is VAX: "The INDEX instruction is used to calculate the address of an array element while at the same time checking to see that the index fits in the array bounds. This is clearly an important function to accurately detect errors in high-level languages statements."
Why is the INDEX
instruction not in RISC-V ISA? - Because it doesn't make sense if you believe the arguments from this paper: D. Patterson, D. Ditzel, "The Case for the Reduced Instruction Set Computer".
I argue that the complexity cost of having tail undisturbed
in addition to tail agnostic
is significantly worse than AAA
and INDEX
because it affects most of RVV instructions, which means verification space explosion.
It also contradicts the open source spirit by raising the complexity barrier and preventing a variety of implementations.
Another downside: confused SW people will use (and abuse) tail undisturbed
in all kinds of unnecessary ways (as evident from this thread). And then they will be even more confused by subtle, hard to predict/hard to explain performance issues of tail undisturbed
in high performance RVV implementations.
Also, with renaming and tail undisturbed some widening instructions need to read 5 vector registers instead of at most 4 without tail undisturbed.
Register file ports are expensive, which means some implementations might choose to execute such instructions at half throughput in tail undisturbed mode.
Is there a reason to keep tail undisturbed?
Tail agnostic includes both tail undisturbed for in-order cores and tail one for out-of-order cores, therefore tail agnostic is good enough for all core implementations.
1) If tail undisturbed is required to be implemented in the out-of-order cores, then every vector load has to be treated as masked vector load, i.e. it should depend on/read previous vector register value/physical register, because in case of exception the rest of vector register, starting from
vstart
, must remain undisturbed. This can have performance impact on vector loads and add non-trivial design and verification complexity.2) Some instructions always operate as tail agnostic, so dropping tail undisturbed will make ISA more consistent.
3) Additional mode has a non-trivial verification cost.
Similar arguments for dropping mask agnostic.