riscv / riscv-v-spec

Working draft of the proposed RISC-V V vector extension
https://jira.riscv.org/browse/RVG-122
Creative Commons Attribution 4.0 International
953 stars 271 forks source link

Ambiguity in vcompress definition #796

Open kdockser opened 2 years ago

kdockser commented 2 years ago

6.5 Vector Compress Instruction has the following text:

The vector mask register specified by vs1 indicates which of the first vl elements of vector register group vs2 should be extracted and packed into contiguous elements at the beginning of vector register vd.

Presumably vl is used to specify how many elements in vd are updated. However, this sentence can easily be misinterpreted as saying that the first vl elements of vector register group vs2 are the only candidates to be moved to vd. I propose that the sentence be rewritten as something along the lines of:

The first vl true elements in the vector mask register specified by vs1 indicates which of the elements of vector register group vs2 should be extracted and packed into contiguous elements at the beginning of vector register vd.

aswaterman commented 2 years ago

The original text is being literal when it says "which of the first vl elements of vector register group vs2 should be extracted and packed": at most vl elements will be copied, and possibly fewer. (I'm not opposed to rewriting this sentence to be clearer, of course, but I don't agree that it's ambiguous.)

kdockser commented 2 years ago

According to section 3.5 Vector length Register vl

The vl register holds an unsigned integer specifying the number of elements to be updated with results from a vector instruction

Are you saying that vcompress doesn't follow this rule and in this case vl applies to how many of the sources are being evaluated to have TRUE masks?

aswaterman commented 2 years ago

Yes. The vcompress definition slyly indicates this is overridden by indicating that the set of tail elements is different for vcompress than for other instructions ("The remaining elements of vd are treated as tail elements").

kdockser commented 2 years ago

So, for example, if vl=5 and the first 5 masks are all F and the remainder are all T, everything is treated as being part of the Tail? If so, that doesn't seem very useful when the most likely use case is compressing the elements with the first 5 T masks.

aswaterman commented 2 years ago

That's a correct interpretation of the spec.

If the application vector length is actually 5, then pulling from elements 5..VLMAX-1 is not actually terribly useful, since they'll just contain garbage.

You can more or less obtain the operation you described by setting vl to max, then ignoring destination elements past element 4.

nick-knight commented 2 years ago

According to section 3.5 Vector length Register vl

The vl register holds an unsigned integer specifying the number of elements to be updated with results from a vector instruction

Are you saying that vcompress doesn't follow this rule and in this case vl applies to how many of the sources are being evaluated to have TRUE masks?

Other instructions also override the original definition of vl --- reductions, for example. The meanings of "active", "body", etc., may also change in these cases.

kdockser commented 2 years ago

Thanks Nick. While I agree that other instructions override the standard behavior, they do so with a clear statement indicating this deviation. In this case, the deviation is not clearly called out. It would help to have an explanation as to why there is a deviation (e.g., what is the use case). Also, SAIL code, or at least pseudo code would be invaluable to clarifying the intended behavior.

This instruction is more like vrgather than vadd in that it is collecting valid data which may lie beyond vl. In my example of vl=5, this vrgather could be followed by a store instruction that is intended to store the 5 compressed elements to contiguous memory. X86, for example, has a single instruction that compresses the elements based on the flags and then writes (just) them to memory; it would be nice if RISC-V could do this in two instructions without having to change the vl (or update the masks) between the two.