Open ZPedro opened 5 years ago
I could see adding vnclipsu and vnclipus to handle both possible changes in signedness, but this would chew up some more encoding space and probably require existing encodings to move.
If the vxsat flag is not important, then vmax.vi v8, v8, 0 # Clip negative to zero vnclipu.vi v4, v8, scale will perform scaled clip of signed to unsigned without using scalar registers.
Good one; while I could see some applications (e.g. studio transcoding) caring about vxsat, the applications I have worked on definitely don't. I am still not used to thinking with readily available min and max instructions (however, I believe you mean vmax.vx v8, v8, x0).
I trust you will find an appropriate balance between encoding space usage, implementation constraints, and software needs; for reference, in the vlseg2b variant of planar YUV 420 to array-of-structures RGB conversion, the main loop dynamically executes 65 non-vsetvli vector instructions, regardless of the workaround used, while the loop would only require 53 if vclipsu.vi was available.
Yes, should have been vmax.vx v8, v8, x0. The immediate max/min forms were not included as given the available small immediate field, they are not very useful except for zero value, which is supported through x0.
I updated my code for v0.9 of the spec:
In this issue it is suggested to add a vnclip variant performing signed to unsigned saturation. The impetus is for signal processing applications where both underflow and overflow need to be guarded against, and where the canonical output format is unsigned. Best workaround I could figure is (assuming 16->8 bit narrowing and an immediate scale factor):
This workaround mobilizes two scalar registers for holding constants, and requires one or two additional instructions (in some cases adding (-128)<<scale can be folded with an earlier addition) per vnclip.
This suggestion is the main takeaway from an attempt to implement planar YUV 420 to array-of-structures RGB conversion using the vector extension. I came up with two versions (both attached): one that relies on segmented loads, and one that relies on vrgather to undo chroma subsampling, the latter of which can conceptually scale to a subsampling factor of 3 (or other non-power of two). Coefficients are taken from Poynton.
Writing this code brought additional insights, but those will have to be elaborated in a separate issue.
YUV420p2RGBX8888-vlseg2b.txt
YUV420p2RGBX8888-vrgather.txt