Suggestion for signed to unsigned saturation vnclip variant (code attached)

ZPedro commented 5 years ago

In this issue it is suggested to add a vnclip variant performing signed to unsigned saturation. The impetus is for signal processing applications where both underflow and overflow need to be guarded against, and where the canonical output format is unsigned. Best workaround I could figure is (assuming 16->8 bit narrowing and an immediate scale factor):

vsadd.vx v8, v8, x31 # (-128)<<scale
vnclip.vi v4, v8, scale
vxor.vx v4, v4, x30 # 0x80

This workaround mobilizes two scalar registers for holding constants, and requires one or two additional instructions (in some cases adding (-128)<<scale can be folded with an earlier addition) per vnclip.

This suggestion is the main takeaway from an attempt to implement planar YUV 420 to array-of-structures RGB conversion using the vector extension. I came up with two versions (both attached): one that relies on segmented loads, and one that relies on vrgather to undo chroma subsampling, the latter of which can conceptually scale to a subsampling factor of 3 (or other non-power of two). Coefficients are taken from Poynton.

Writing this code brought additional insights, but those will have to be elaborated in a separate issue.

YUV420p2RGBX8888-vlseg2b.txt

YUV420p2RGBX8888-vrgather.txt

kasanovic commented 4 years ago

I could see adding vnclipsu and vnclipus to handle both possible changes in signedness, but this would chew up some more encoding space and probably require existing encodings to move.

kasanovic commented 4 years ago

If the vxsat flag is not important, then vmax.vi v8, v8, 0 # Clip negative to zero vnclipu.vi v4, v8, scale will perform scaled clip of signed to unsigned without using scalar registers.

ZPedro commented 4 years ago

Good one; while I could see some applications (e.g. studio transcoding) caring about vxsat, the applications I have worked on definitely don't. I am still not used to thinking with readily available min and max instructions (however, I believe you mean vmax.vx v8, v8, x0).

I trust you will find an appropriate balance between encoding space usage, implementation constraints, and software needs; for reference, in the vlseg2b variant of planar YUV 420 to array-of-structures RGB conversion, the main loop dynamically executes 65 non-vsetvli vector instructions, regardless of the workaround used, while the loop would only require 53 if vclipsu.vi was available.

kasanovic commented 4 years ago

Yes, should have been vmax.vx v8, v8, x0. The immediate max/min forms were not included as given the available small immediate field, they are not very useful except for zero value, which is supported through x0.

ZPedro commented 4 years ago

I updated my code for v0.9 of the spec:

Switched to vwmul and friends following removal of vwsmacc; results in a net decrease in instruction count, but a lot of those were vmv.v.i vx, 0, so in terms of meaningful instructions the net result is rather an increase.
Switched to new EEW load and store instructions. No impact.
Adopted modern vsetvli method for keeping vl as it was.
Miscellaneous fixes.

YUV420p2RGBX8888-vlseg2b.txt YUV420p2RGBX8888-vrgather.txt

riscvarchive / riscv-v-spec

Suggestion for signed to unsigned saturation vnclip variant (code attached) #287