riscvarchive / riscv-v-spec

Working draft of the proposed RISC-V V vector extension
https://jira.riscv.org/browse/RVG-122
Creative Commons Attribution 4.0 International
968 stars 272 forks source link

There is no vrscatter instruction in the spec. #790

Open Zissi-Lei opened 2 years ago

Zissi-Lei commented 2 years ago

Hi, I'm reading the rvv-1.0 spec and found that there is a vrgather instruction but no corresponding vrscatter instrution. Is there another consideration? I want to known why, thanks for your time!

nick-knight commented 2 years ago

Suppose you want to scatter data [A B C D] to destination indices [1 3 5 7], as follows:

index:  7 6 5 4 3 2 1 0
before: x x x x x x x x
after:  D x C x B x A x

To do so, we can use masked vrgather.vv, with the following input operands

data:  x x x x D C B A
mask:  1 0 1 0 1 0 1 0
index: 3 x 2 x 1 x 0 x

(Here, x means "do not care".) The challenge then becomes constructing these mask and (source) index vector operands from the destination indices ([1 3 5 7]).

For example, if the mask is available and the destination indices form an increasing sequence, like in this case, we can use viota.m to construct the corresponding source indices from the (source) mask vector, as discussed in the context of "vdecompress".

In other cases, I suspect it will be best to use an indexed store, perhaps preceded by a unit-stride store (for the "undisturbed" elements) followed by a unit-stride load. If using the memory system is not an option, then in the worst case you might end up constructing mask and index one element at a time, with slides and bit-twiddling; I haven't thought through the details.

howjmay commented 12 months ago

Hi

Suppose you want to scatter data [A B C D] to destination indices [1 3 5 7], as follows:

index:  7 6 5 4 3 2 1 0
before: x x x x x x x x
after:  D x C x B x A x

To do so, we can use masked vrgather.vv, with the following input operands

data:  x x x x D C B A
mask:  1 0 1 0 1 0 1 0
index: 3 x 2 x 1 x 0 x

(Here, x means "do not care".) The challenge then becomes constructing these mask and (source) index vector operands from the destination indices ([1 3 5 7]).

For example, if the mask is available and the destination indices form an increasing sequence, like in this case, we can use viota.m to construct the corresponding source indices from the (source) mask vector, as discussed in the context of "vdecompress".

In other cases, I suspect it will be best to use an indexed store, perhaps preceded by a unit-stride store (for the "undisturbed" elements) followed by a unit-stride load. If using the memory system is not an option, then in the worst case you might end up constructing mask and index one element at a time, with slides and bit-twiddling; I haven't thought through the details.

Hi @nick-knight I am curious about the performance. Doesn't doing unit-stride store and unit-stride load have higher overhead than use a vrgather? The unit-stride load/store needs to interact with memory twice. I thought which is a huge cost comparing to process it in registers

nick-knight commented 12 months ago

“Performance” is a consequence of the implementation, not the interface (ISA). This repo only concerns the interface. Nowhere in this repo does it discuss “overhead”, “runtime”, “cycles”, etc. Your question should be directed at the hardware engineers who are implementing the vector processor you are targeting.

If you were on my engineering team, I’d tell you to implement both variants, benchmark them, and report back to me which one was better.

I hope this makes sense!

howjmay commented 12 months ago

Thank you. I was thinking whether there is a general guideline for efficient implementation, but as you said, is shouldn't be the topic in this repo.