Open Zissi-Lei opened 2 years ago
Suppose you want to scatter data [A B C D]
to destination indices [1 3 5 7]
, as follows:
index: 7 6 5 4 3 2 1 0
before: x x x x x x x x
after: D x C x B x A x
To do so, we can use masked vrgather.vv
, with the following input operands
data: x x x x D C B A
mask: 1 0 1 0 1 0 1 0
index: 3 x 2 x 1 x 0 x
(Here, x
means "do not care".) The challenge then becomes constructing these mask
and (source) index
vector operands from the destination indices ([1 3 5 7]
).
For example, if the mask is available and the destination indices form an increasing sequence, like in this case, we can use viota.m
to construct the corresponding source indices from the (source) mask vector, as discussed in the context of "vdecompress
".
In other cases, I suspect it will be best to use an indexed store, perhaps preceded by a unit-stride store (for the "undisturbed" elements) followed by a unit-stride load. If using the memory system is not an option, then in the worst case you might end up constructing mask
and index
one element at a time, with slides and bit-twiddling; I haven't thought through the details.
Hi
Suppose you want to scatter data
[A B C D]
to destination indices[1 3 5 7]
, as follows:index: 7 6 5 4 3 2 1 0 before: x x x x x x x x after: D x C x B x A x
To do so, we can use masked
vrgather.vv
, with the following input operandsdata: x x x x D C B A mask: 1 0 1 0 1 0 1 0 index: 3 x 2 x 1 x 0 x
(Here,
x
means "do not care".) The challenge then becomes constructing thesemask
and (source)index
vector operands from the destination indices ([1 3 5 7]
).For example, if the mask is available and the destination indices form an increasing sequence, like in this case, we can use
viota.m
to construct the corresponding source indices from the (source) mask vector, as discussed in the context of "vdecompress
".In other cases, I suspect it will be best to use an indexed store, perhaps preceded by a unit-stride store (for the "undisturbed" elements) followed by a unit-stride load. If using the memory system is not an option, then in the worst case you might end up constructing
mask
andindex
one element at a time, with slides and bit-twiddling; I haven't thought through the details.
Hi @nick-knight I am curious about the performance. Doesn't doing unit-stride store and unit-stride load have higher overhead than use a vrgather? The unit-stride load/store needs to interact with memory twice. I thought which is a huge cost comparing to process it in registers
“Performance” is a consequence of the implementation, not the interface (ISA). This repo only concerns the interface. Nowhere in this repo does it discuss “overhead”, “runtime”, “cycles”, etc. Your question should be directed at the hardware engineers who are implementing the vector processor you are targeting.
If you were on my engineering team, I’d tell you to implement both variants, benchmark them, and report back to me which one was better.
I hope this makes sense!
Thank you. I was thinking whether there is a general guideline for efficient implementation, but as you said, is shouldn't be the topic in this repo.
Hi, I'm reading the rvv-1.0 spec and found that there is a vrgather instruction but no corresponding vrscatter instrution. Is there another consideration? I want to known why, thanks for your time!