riscv / riscv-v-spec

Working draft of the proposed RISC-V V vector extension
https://jira.riscv.org/browse/RVG-122
Creative Commons Attribution 4.0 International
953 stars 271 forks source link

Count leading zeros and shifts by signed amounts #691

Open dsharletg opened 3 years ago

dsharletg commented 3 years ago

Hi all,

I have been reviewing the riscv-v spec version 0.10. There are a few small missing features/instructions that I have found useful when implementing fixed point arithmetic on other targets:

Vector count leading zeros (clz) is useful to compute

floor(log2(x)) = sizeof(x) * 8 - 1 - clz(x)

Shifts by signed amounts are useful because they compute functions like:

floor(x / 2^y) = shift_right(x, y)
floor(x * 2^y) = shift_left(x, y)
round(x / 2^y) = rounding_shift_right(x, y)
round(x * 2^y) = rounding_shift_left(x, y)

Where the above hold for positive and negative y. The main reason these are useful isn't necessarily to reduce cycles, but to reduce program size/complexity/register pressure.

Together, these are useful for implementing many good fixed point approximations to expensive operations. For example:

These ideas are used to implement a variety of helper functions here: https://github.com/halide/Halide/blob/00bfad7ed53ce87f20b41594f100451f3043d0cf/apps/hannk/halide/common_halide.cpp#L63-L226 The code is Halide, but hopefully fairly readable.

In that implementation, I used cubic polynomials, which give a max relative error of ~0.05% for both log2 and exp2. I took a quick survey of some generated code on AArch64, and found that signed shifts reduce the number of instructions by ~20% in some cases using those helpers. This didn't cause any stack spills, so all of the extra instructions are arithmetic. The impact would be even worse if the added register pressure caused stack spills. (I determined this by just explicitly writing the source code assuming that shifts must always have unsigned RHSes and examining the generated code.)

There is a comment on line 171 mentioning these are slow on x86. Not coincidentally, x86 lacks both vector clz(x) and shifts by signed amounts :)

dsharletg commented 2 years ago

I think that it would really help in the future to take a look at this before finalizing the v1.0 spec. I suggest at least for now just changing this wording in section 11.6 Vector Single-Width Shift Instructions:

Only the low lg2(SEW) bits of the shift-amount value are used to control the shift amount.

To something indicating that non-zero bits above lg2(SEW) are undefined behavior, and so later could be changed to interpret negative values as shifts of the opposite direction.