Count leading zeros and shifts by signed amounts

Hi all,

I have been reviewing the riscv-v spec version 0.10. There are a few small missing features/instructions that I have found useful when implementing fixed point arithmetic on other targets:

Vector count leading zeros: clz(x).
Shifts by signed amounts, particularly rounding variants. This doesn’t need new instructions AFAICT, only changing the existing shift amount operand from unsigned to signed.

Vector count leading zeros (clz) is useful to compute

floor(log2(x)) = sizeof(x) * 8 - 1 - clz(x)

Shifts by signed amounts are useful because they compute functions like:

floor(x / 2^y) = shift_right(x, y)
floor(x * 2^y) = shift_left(x, y)
round(x / 2^y) = rounding_shift_right(x, y)
round(x * 2^y) = rounding_shift_left(x, y)

Where the above hold for positive and negative y. The main reason these are useful isn't necessarily to reduce cycles, but to reduce program size/complexity/register pressure.

Together, these are useful for implementing many good fixed point approximations to expensive operations. For example:

Approximating log2(x/2^N)*2^M with a polynomial can be done with a clz, a few signed shifts, and a few vsmul operations (2 per polynomial degree above 1).
Approximating exp2(x/2^N)*2^M with a polynomial is very similar (but doesn’t require clz).
Combining the above, we can approximate a(b/c) = aexp2(log2(b) - log2(c)) very smoothly (sometimes that is more important than accuracy), without using division. This can be useful even when the degree of the polynomial is 1 (i.e. no vsmul required).
2^N/sqrt(x) = exp2(-log2(x)/2) and sqrt(x) = exp2(log2(x)/2)
Functions that appear frequently in machine learning can be nicely approximated, e.g. tanh/logistic, softmax intermediates, etc.

These ideas are used to implement a variety of helper functions here: https://github.com/halide/Halide/blob/00bfad7ed53ce87f20b41594f100451f3043d0cf/apps/hannk/halide/common_halide.cpp#L63-L226 The code is Halide, but hopefully fairly readable.

In that implementation, I used cubic polynomials, which give a max relative error of ~0.05% for both log2 and exp2. I took a quick survey of some generated code on AArch64, and found that signed shifts reduce the number of instructions by ~20% in some cases using those helpers. This didn't cause any stack spills, so all of the extra instructions are arithmetic. The impact would be even worse if the added register pressure caused stack spills. (I determined this by just explicitly writing the source code assuming that shifts must always have unsigned RHSes and examining the generated code.)

There is a comment on line 171 mentioning these are slow on x86. Not coincidentally, x86 lacks both vector clz(x) and shifts by signed amounts :)

riscv / riscv-v-spec

Count leading zeros and shifts by signed amounts #691