riscv-non-isa / riscv-c-api-doc

Documentation of the RISC-V C API
https://jira.riscv.org/browse/RVG-4
Creative Commons Attribution 4.0 International
68 stars 38 forks source link

vget for fractional register doesn't exist #54

Closed howjmay closed 10 months ago

howjmay commented 10 months ago

Hi I am developing an application with RVV. And I have a question

There seem to be no intrinsics vget that take a whole register and return a fractional register. Namely, intrinsics like vint8mf2 __riscv_vget_v_i8m1_i8mf2(vint8m1_t src, size_t index) don't exist. The same behavior can be achieved by trunc, and slidedown intrinsics, but it seems to be good to have them.

topperc commented 10 months ago

Isn't the LMUL trunc enough. Why do you need a slide down?

howjmay commented 10 months ago

trunc is equal to vget with index 0. But if I want index 1 I need to slidedown first then I can get the element I want. An unnecessary instruction is taken here

nick-knight commented 10 months ago

The trunc intrinsics do not correspond to RVV instructions. Using them should not incur any performance overhead.

There are many other cases in the API where we could introduce new intrinsics to reduce the amount of (C language) typing. For example, we could implement the whole Cartesian product of vreinterpret intrinsics (not just the (block) diagonal). The intrinsics API is already enormous, and we have tried to avoid introducing new intrinsics when they were not necessary for performance.

In this particular case, I'm mildly opposed to the proposal, because it means the proposed vget intrinsics with fractional LMUL output would have a performance cost (due to the vslidedown), unlike the existing vget intrinsics. This might be surprising to programmers.

howjmay commented 10 months ago

My points for this proposal are

  1. It is weird that the integer LMUL has vget intrinsics, and the fractional ones don't have.
  2. The extra overhead caused by the slidedown here is strange.
  3. It is not instinct, so the RVV beginners may take some time to find the combination to achieve the behavior

Another question is whether I should move this issue to https://github.com/riscv-non-isa/rvv-intrinsic-doc for further discussion? I don't know whether I open this in a wrong repo

topperc commented 10 months ago

trunc is equal to vget with index 0. But if I want index 1 I need to slidedown first then I can get the element I want. An unnecessary instruction is taken here

A vget for fractional would have to do the same slidedown in order to the elements into the lower elements of a register to meet up with later intrinsics.

The vget for whole LMUL just has to access a different register where the elements are already in the right place.

cmuellner commented 10 months ago

The RVV intrinsics repo can be found here: https://github.com/riscv-non-isa/rvv-intrinsic-doc. Feature requests for the RVV intrinsics should be reported there.

Mentioning @eopXD so he is aware of this.

eopXD commented 10 months ago

@howjmay Yes https://github.com/riscv-non-isa/rvv-intrinsic-doc should be a better place to raise the discussion. I second with Craig that even though introducing such intrinsic, at code generation we will still need to use vslidedown to fulfill the semantic. I don't think it is functional incompleteness here. On the other hand, it also brings curiosity to me on the intention for you to go that deep into fractional LMUL.

howjmay commented 10 months ago

I have opened an issue there

howjmay commented 10 months ago

@howjmay Yes https://github.com/riscv-non-isa/rvv-intrinsic-doc should be a better place to raise the discussion. I second with Craig that even though introducing such intrinsic, at code generation we will still need to use vslidedown to fulfill the semantic. I don't think it is functional incompleteness here. On the other hand, it also brings curiosity to me on the intention for you to go that deep into fractional LMUL.

Hi @eopXD I am implementing a translator from NEON to RVV (here https://github.com/howjmay/neon2rvv/) I am mapping the int8x8_t type in NEON to vint8mf2 in RVV. But I am still checking the potential efficiency impact if I am doing with fractional register but not a whole LMUL. May I ask for your suggestion where could mention the potential impact?

If I switch to use a whole register, I am curious whether switching the value of vl so often would cause some overhead. I saw some description in the spec mentioned that changing the value of vl is left for compiler optimization

cmuellner commented 10 months ago

Since the discussion moved to the right place, I'm closing this ticket.