riscv / riscv-v-spec

Working draft of the proposed RISC-V V vector extension
https://jira.riscv.org/browse/RVG-122
Creative Commons Attribution 4.0 International
950 stars 271 forks source link

Further future proofing: Novel encoding of extended LMUL values 3,5,6 and 7 #460

Open David-Horner opened 4 years ago

David-Horner commented 4 years ago

A1.

Although 3,5,6 and 7 physical registers can make up a logical register group ( on 8 register boundaries, and in the case of 3 on 4 physical register boundary) there is no direct mechanism to establish the vl value to effect these multiplier values.

A2.

Currently to support these logical register groups, software needs to emulate them by each loop

Proposal:

Provide additional field, valt, in vtype (allocated from bits 30 to 12). Use the top 3 bits of rd to encode the same bits in valt. This provides 3 more state bits to provide "altmul" and future functionality. If least 2 bits of rs1 do not match the least 2 bits of rd then valt is set to zero. If rs1 = rd then valt is set to zero. If the top 3 bits of rs1 provide a reserved value, vill is set. Otherwise valt is set to the top 3 bits of rd. If the code is zero the LMUL behaviour is as described in #458.

(table) valt code given rs1 and rd values

upper 3 bits of rd potential rd register designates two low bits of rs1 compared to two low bits of rd valt code
xxx any different 000
xxx rs1 = rd same (by definition) 000
000 0,1,2,3 either same or different 000
001 4,5,6,7 same 001 ("altmul")
010 8,9,10,11 same reserved
011 12,13,14,15 same reserved
100 16,17,18,19 same reserved
101 20,21,22,23 same reserved
110 24,25,26,27 same reserved
111 28,29,30,31 same reserved

Reserved values set vill (illegal vtype configuration attempted).

If the code in valt is 001, "altmul", then vlmul has these modified meanings:

(table) new LMUL value when valt is 001("altmul")

3 upper rs1 bits original LMUL new LMUL value for vl calculation value written to vlmul notes
000 reserved reserved
001 1/8 1/8 001 (rs1=rd rule)
010 1/4 1/4 010
011 1/2 1/2 011
100 1 3 110 (4)
101 2 5 111 (8)
110 4 6 111 (8)
111 8 7 111 (8)

As before (and by definition) attempted use of reserved values sets vill. (Note: the 1/8 LMUL result is a consequence of the rs1=rd rule. 1/2 and 1/4 are the only codes that could be used for future expansion, so simplifying the decode appeared to be a reasonable trade-off. This would be the same result if the choice for "altmul" valt code was instead 010 or 011.)

The value of vl is calculated according to the new LMUL value. That is, vl = VLEN * LMUL / SEW, for all, including new non-powers of 2 values, of LMUL.

LMUL of 3 must be aligned on a register group of 4. LMULs 5,6 and 7 must be aligned on a register group of 8. The new value of 4 or 8 written to vlmul assures this. Once vl is calculate the "actual" value of LMUL is no longer retained. As in the fully software case, it is sufficient that vl is constrained to a value that does not exceed the desired number of physical registers.

A4.

Using rd for vtype encoding must be determined before v1.0 release, so this proposal is timely. Zero for valt is intended to be the default value and the (statically) highest use code. Therefore encoding is heavily weighed towards zeroing valt. For a given value of vs1, there is only one register selection that provides a non-zero value. Encodings for non-zero codes are intentionally restricted in this way to allow most flexibility

Setting valt to zero when rs1 =rd allows the default low functionality setting to be generated when RVI register pressure is so great that rd shares the same register as rs1. Setting valt to zero when register x0 is the destinations for rd allows special casing the x0 encoding. This is a common scenario and provisioning the option is valuable, even if not immediately implemented.

There is a substantial amount of reserved state in valt, so future expansion may be able to use these, rather than bits from the immediate field. If encoding in rs1 is acceptable, then there is very little reason to reject this similar encoding within rd.
This encoding is a reasonable tradeoff of RVI complexity for RVV benefit. See #458 for further reasoning .

More bits might be difficult to justify for vlmul for potentially low use LMUL values of 3,5,6 and 7, but this is mitigated by the similar use of rd as used for rs1.

A5.

If you didn't like encoding within rs1 then even more so encoding in rd is problematic, given the reasons rd was allocated a fixed location in RVI. However, S and B formats use the field for immediate values and RVC does not honour rd either.

Is there a legitimate need to define such a complex encoding?

Aside from the encoding, some might think there is no need for LMUL of 3,5,6 and 7. Instead, software deriving them by using restricted (fractional, 3/8, 5/8 etc.) AVL values is sufficient.

As always, any extension requires extra work in standardization and validation/verification.

The field name valt is vague and too close to vault to be taken seriously. Double for the sub-code "altmul".


Krste asked for my proposals to be in this (implicit) answer format: Q1. What is the problem? Q2. What's wrong with what's there now? Q3. What is the solution you're proposing? Q4. How is it better? Q5. How is it worse? So if these posts appear in a stilted, akward , contrived format for the particular subject material, please forgive my deficiencies in working within this framework.

Further, as is often the case, problems can be associated with related issues in a way that the consideration of both (all) is optimal for the resolution of either (any). I believe this is such a situation, and how well it lends itself (themselves) to this format is also debatable. However I made an attempt to present both main issues in a coherent manner.

David-Horner commented 4 years ago

oops.

The input to the algorithm to produce effective LMUL and actual vlmul values is 3 upper rs1 bits , not vlmul bits. I'm going back to correct the text. However, the general idea remains the same. (the vlmul values are the default; what would have been there if valt were zero)

David-Horner commented 4 years ago

see #458 where Krste states:

I don't like the idea of limiting rs1 register names now to leave space for more bits later. As long as one or more immediate fields in vsetvli have a reserved encoding, we can always reclaim rs1 bits later by declaring the reserved encoding to modify the instruction format

The same arguments (and likely aversion) are applicable to the "rd encoding" that extends the immediate values.

The difference in this case is that the bit enabling the effective LMUL values of 3,5,6 and 7 is a transient bit. It does not need to be retained in vtype, and should not be retained any more than the original AVL value in rs1 need to be retained.

However, if an additional bit is willing to be used from the vsetvli immediate field this proposal reduces to the valt encoding being just that bit.

Note: The bit allocated from the immediate field will displace other uses for the bit. The proposal, as modified, does not provide a reserved code (even in conjunction with vlmul) and thus cannot be used as an escape to an "rs1/rd encoding" format.

This revised proposal is consistent with the note in the original posting:

Note: This issues is a proposal that includes two sub proposals.

Expanding LMUL values , and
Encoding additional vtype values in rd field of vsetvli.
I believed the choices for each affects the weightings chosen for the other,
and so I conflate them here. There is less value in rd encoding if there is no
immediate use.
They could be separated if there is significant challenge to either proposal.

I am certainly willing to open a new issue to reflect just the "Expanding LMUL values" if the consensus is that this would be helpful for consideration of the proposal.

David-Horner commented 4 years ago

Although technically not a part of this proposal, the decision to defer until after V1.0 would greatly impact proposal #460.

As we are down to one vsetvli immediate bit, and all used bits consume all combinations as valid, we are at the cusp where we should consider the impacts of expansion.

Expansion requires providing at least one of the following: 1) a vsetvli with a larger vsetvli encoding (one bit can be immediately recovered if we allocate vsetvl elsewhere)
allowing full configuration withing a single instruction. Opcode space is constrained so the expansion would be at best a few bits. 2) variants of the vsetvli instruction that windows successively smaller immediate values into vtype. necessitating splitting full configuration over two (or more) instructions. 3) variants of vsetvli that repurpose rd and/or rs1 as immediate fields, with or without calculating vl. necessitating splitting full configuration over two (or more) instructions. 4) multiple vsetvli-like instructions (this is the mechanism required for custom RVV to change custom vtype bits) necessitating splitting full configuration over two (or more) instructions; unless it introduces a larger footprint vsetvli-like instruction as in option 1. 5) a revision of vsetvli that encodes vtype bits into the rd and/or rs1 fields allowing full configuration withing a single instruction. see #460

Splitting configuration over two immediate format instructions is problematic. It will in some cases cause transient vill state necessitating a mechanism to allow such transition, maintaining at least partial state from prior vsetvli variants (likely relaxing the all bits cleared but vill constraint), or a software fall back to vsetvl method.

Option 1 must be decided before V1.0 (otherwise it becomes an option 4 with two overlapping instructions)

Options 2 through 5 can be deferred until after V1.0 but each have impacts to

Options 2 through 4 have the additional concerns of:

Deferred options 2,3 and 5 need not trap on V1.0 machines as vill would be set if the remaining bit is used to differentiate.

Deferred option 4 will always trap on V1.0 machines.

Option 5 is a novel approach to the problem. If it is deferred until post V1.0