Fractional LMUL - Githubissues

ebahapo commented 4 years ago

We have made good progress, but I'm afraid that the release 0.9 of the V spec is coming down fast and methinks that the most radical change that it introduces is the new values of LMUL.

Please, share your thoughts about it here.

kito-cheng commented 4 years ago

Proposal for type system and API:

Vector Types for Fractional LMUL:

v{TYPE}{SEW}m{LMUL}_t

Type = vint | vuint | vfloat
SEW = 8 | 16 | 32 | 64
LMUL = f8 | f4 | f2 | 1 | 2 | 4 | 8
e.g.
- vint32mf2_t for LMUL=1/2 SEW=32

Changes:

Add f2, f4 and f8 to LMUL to reflect type system change.

Vector Tuple Types for Fractional LMUL:

v{TYPE}{SEW}m{LMUL}x{NF}_t

Type = vint | vuint | vfloat
SEW = 8 | 16 | 32 | 64
LMUL = f8 | f4 | f2 | 1 | 2 | 4 | 8
NF = 1 | 2 | 3 | 4 |5 | 6 | 7 |8
LMUL x NF < 8
- constrained by HW.
e.g.
- Add f2, f4 and f8 to LMUL to reflect type system change.
- Update constraint, because LMUL < 1 still occupy 1 vector register.

Changes:

Add mf[2|4|8] to LMUL to reflect type system change.

Changes to Intrinsic API Naming Rules:

INTRINSIC ::= MNEMONIC '_' RET_TYPE
MNEMONIC ::= Instruction name in v-ext specification. Replace '.' with '_'.
RET_TYPE ::= SEW LMUL
SEW ::= ( i8 | i16 | i32 | i64 | u8 | u16 | u32 | u64 | f16 | f32 | f64 )
LMUL ::= ( mf8 | mf4 | mf2 | m1 | m2 | m4 | m8 )

Changes:

Add mf[2|4|8] to LMUL to reflect type system change.

Issue for Fractional LMUL

Unlike integer/non-fractional LMUL, some fractional LMUL configuration will raise illegal instruction exception under certain HW configuration.
- 3.3.2. Vector Register Grouping (vlmul[2:0]) from v-spec "Implementations must support fractional LMUL settings for LMUL ≥ SEW/ELEN, for the ELEN value at LMUL=1, which ensures there is space to store at least one element. An attempt to set an unsupported SEW and LMUL configuration sets the vill bit in vtype."
- e.g. vint64mf4(SEW=64, LMUL=1/4) not supported on HW with ELEN=64, VLEN=128
- According spec, HW with ELEN=64, VLEN=256 might not support vint64mf4(SEW=64, LMUL=1/4).
- Possible solution:
- Add compiler option to assume the minimal VLEN on the target machine.
  - e.g. -mmin-vlen=256
- Add compiler option to assume the minimal SEW on the target machine.
  - e.g. -mmin-sew=128
- Add compiler option to enable certain fractional LMUL type.
  - e.g. -mflmul=all, -mflmul=no, -mflmul=64mf8, -mflmul=64mf4
- Add minimal VLEN requirement or used fractional LMUL type list in ELF attribute.

rdolbeau commented 4 years ago

Seems OK to me; I like the idea of extending ELF for such requirements. Might be generally useful for extensions in general (i.e. have ELF attribute for V, some properties of V, but also B, ...).

Hsiangkai commented 4 years ago

It looks good to me.

David-Horner commented 4 years ago

@kito-cheng What is NF? it is not immediately apparent from

Fractional LMUL is not the only disruptive change.

LMUL no longer stripes vertically, SLEN determines a horizontal interleave instead. As a result

the poor man's 128 shuffle when SLEN=64 no longer works.
if SLEN=1/2 * VLEN then all even elements are clustered (consecutively stored) together in low bytes of each physical register and all odd elements are clustered in the upper 1/2 bytes of the register.
- thus if vl <VLMAX * LMUL there is a gap (tail) in the middle as well as the end of the (last) physical register.
- if SLEN=1/4 VLEN the interleave is by 4, with clustering of modulo 4 elements, and if vl<VLMAX LMUL 4 gaps on at the end and 3 in the middle exist.

It is no longer the element order that gets shuffled, but only when LMUL>1. Instead even at LMUL=1 different Element length affects element content.

The element length and VLEN/SLEN determine the alignment structure. Thus if VLEN/SLEN > 1, component bytes of elements are no longer in in-memory order. Load MAXLV bytes into a register, then the half-words read from the register will have every other byte from memory in their upper and lower haves. Same type of story for word, none of those bytes will be from consecutive locations in memory.

The good and the bad of this is that most initial implementations are expected to have VLEN=SLEN. Those that do have SLEN<VLEN may well jump to SLEN=1/4 VLEN or 1/8th, as it is expected that only the higher performance larger VLEN will need to limit SLEN due to wiring issues. So, SLEN = 1/2 VLEN, which is a nice match for register pair processing (e.g. Complex numbers) is going to be rare.

But all the code needs to accommodate the in-register format not matching in-memory. There are suggestions on how to mitigate this in hardware. These intrinsics should be prepared to differentiate between in-memory-order agnostic and reliant structures/operations. The good news is that most operations are in-memory-order agnostic. e.g. all single width arithmetic. Even most mixed width operations are not going to care. But any sub-element component manipulation will need to be aware and careful.

Finally, I have noticed discussions about matching of masks under a given element length and LMUL with another element length or LMUL. Given that it is a definite concern and apparently at least moderately frequent in real code situations, you should know of a proposal for mask support that is ordinal based. Regardless of Element Length or LMUL the nth mask bit applies to the nth vector element. In all cases, a single bit is used to store the mask value. The issue is #448 in the riscv/riscv-v-spec github.

David-Horner commented 4 years ago

Well, not so finally apparently.

Another thing to mention:

Because LMUL no longer does vertical striping, but horizontal interleave, each physical register has the same characteristics. Physical registers are filled consecutively. This means the register grouping by powers of 2 is no longer a constraint. So, LMUL can take on all values between 1 and 8. This is good for intrinsics that can use a value of say 6, freeing up a register pair for two mask registers or a further m2 variable. A second vsetvl[i] instruction with a limiting AVL is necessary (currently), but as mentioned elsewhere in these comments it can be a low cost operation and the tradeoff is definitely worth it in some scenarios. (I also will be proposing an LMUL to 3,5,6 or 7 option based on ideas in riscv/riscv-v-spec github issue #418 , although that targeted the 0.8 structure.)

kito-cheng commented 4 years ago

@kito-cheng What is NF? it is not immediately apparent from

NF meaning NFIELDS, which is the term from segment load/store, vector tuple type are used for segment load store intrinsic API.

You can see this issue for more detail: https://github.com/sifive/rvv-intrinsic-doc/issues/11

eopXD commented 2 years ago

Fractional LMUL is now defined and implemented in RVV intrinsic. Closing this issue.

riscv-non-isa / rvv-intrinsic-doc

Fractional LMUL #15

Vector Types for Fractional LMUL:

Vector Tuple Types for Fractional LMUL:

Changes to Intrinsic API Naming Rules:

Issue for Fractional LMUL