Further future proofing: Novel encoding of extended LMUL values 3,5,6 and 7

David-Horner commented 4 years ago

A1.

Although 3,5,6 and 7 physical registers can make up a logical register group ( on 8 register boundaries, and in the case of 3 on 4 physical register boundary) there is no direct mechanism to establish the vl value to effect these multiplier values.

A2.

Currently to support these logical register groups, software needs to emulate them by each loop

establishing a reduce vl from the setvli instruction as the AVL into a second vsetvl[i] instruction per loop, or alternatively,
hoisting the first vsetvli from the loop, and use a single vsetvli plus additional register use and conditional logic to simulate the vl calculation for it.
A3.

Note: This issues is a proposal that includes two sub proposals.
Expanding LMUL values , and
Encoding additional vtype values in rd field of vsetvli. I believed the choices for each affects the weightings chosen for the other, and so I conflate them here. There is less value in rd encoding if there is no immediate use. They could be separated if there is significant challenge to either proposal.

Proposal:

Provide additional field, valt, in vtype (allocated from bits 30 to 12). Use the top 3 bits of rd to encode the same bits in valt. This provides 3 more state bits to provide "altmul" and future functionality. If least 2 bits of rs1 do not match the least 2 bits of rd then valt is set to zero. If rs1 = rd then valt is set to zero. If the top 3 bits of rs1 provide a reserved value, vill is set. Otherwise valt is set to the top 3 bits of rd. If the code is zero the LMUL behaviour is as described in #458.

(table) valt code given rs1 and rd values

upper 3 bits of rd	potential rd register designates	two low bits of rs1 compared to two low bits of rd	valt code
xxx	any	different	000
xxx	rs1 = rd	same (by definition)	000
000	0,1,2,3	either same or different	000
001	4,5,6,7	same	001 ("altmul")
010	8,9,10,11	same	reserved
011	12,13,14,15	same	reserved
100	16,17,18,19	same	reserved
101	20,21,22,23	same	reserved
110	24,25,26,27	same	reserved
111	28,29,30,31	same	reserved

Reserved values set vill (illegal vtype configuration attempted).

If the code in valt is 001, "altmul", then vlmul has these modified meanings:

(table) new LMUL value when valt is 001("altmul")

3 upper rs1 bits	original LMUL	new LMUL value for vl calculation	value written to vlmul	notes
000	reserved	reserved
001	1/8	1/8	001	(rs1=rd rule)
010	1/4	1/4	010
011	1/2	1/2	011
100	1	3	110 (4)
101	2	5	111 (8)
110	4	6	111 (8)
111	8	7	111 (8)

As before (and by definition) attempted use of reserved values sets vill. (Note: the 1/8 LMUL result is a consequence of the rs1=rd rule. 1/2 and 1/4 are the only codes that could be used for future expansion, so simplifying the decode appeared to be a reasonable trade-off. This would be the same result if the choice for "altmul" valt code was instead 010 or 011.)

The value of vl is calculated according to the new LMUL value. That is, vl = VLEN * LMUL / SEW, for all, including new non-powers of 2 values, of LMUL.

LMUL of 3 must be aligned on a register group of 4. LMULs 5,6 and 7 must be aligned on a register group of 8. The new value of 4 or 8 written to vlmul assures this. Once vl is calculate the "actual" value of LMUL is no longer retained. As in the fully software case, it is sufficient that vl is constrained to a value that does not exceed the desired number of physical registers.

A4.

Using rd for vtype encoding must be determined before v1.0 release, so this proposal is timely. Zero for valt is intended to be the default value and the (statically) highest use code. Therefore encoding is heavily weighed towards zeroing valt. For a given value of vs1, there is only one register selection that provides a non-zero value. Encodings for non-zero codes are intentionally restricted in this way to allow most flexibility

in allowing any RVI register such encoding in rd, and
- leveraging the flexibility already granted rs1 in #458 to relax the constrains on selecting rd.

Setting valt to zero when rs1 =rd allows the default low functionality setting to be generated when RVI register pressure is so great that rd shares the same register as rs1. Setting valt to zero when register x0 is the destinations for rd allows special casing the x0 encoding. This is a common scenario and provisioning the option is valuable, even if not immediately implemented.

There is a substantial amount of reserved state in valt, so future expansion may be able to use these, rather than bits from the immediate field. If encoding in rs1 is acceptable, then there is very little reason to reject this similar encoding within rd.
This encoding is a reasonable tradeoff of RVI complexity for RVV benefit. See #458 for further reasoning .

More bits might be difficult to justify for vlmul for potentially low use LMUL values of 3,5,6 and 7, but this is mitigated by the similar use of rd as used for rs1.

A5.

If you didn't like encoding within rs1 then even more so encoding in rd is problematic, given the reasons rd was allocated a fixed location in RVI. However, S and B formats use the field for immediate values and RVC does not honour rd either.

Is there a legitimate need to define such a complex encoding?

Aside from the encoding, some might think there is no need for LMUL of 3,5,6 and 7. Instead, software deriving them by using restricted (fractional, 3/8, 5/8 etc.) AVL values is sufficient.

As always, any extension requires extra work in standardization and validation/verification.

The field name valt is vague and too close to vault to be taken seriously. Double for the sub-code "altmul".

Krste asked for my proposals to be in this (implicit) answer format: Q1. What is the problem? Q2. What's wrong with what's there now? Q3. What is the solution you're proposing? Q4. How is it better? Q5. How is it worse? So if these posts appear in a stilted, akward , contrived format for the particular subject material, please forgive my deficiencies in working within this framework.

Further, as is often the case, problems can be associated with related issues in a way that the consideration of both (all) is optimal for the resolution of either (any). I believe this is such a situation, and how well it lends itself (themselves) to this format is also debatable. However I made an attempt to present both main issues in a coherent manner.

David-Horner commented 4 years ago

oops.

The input to the algorithm to produce effective LMUL and actual vlmul values is 3 upper rs1 bits , not vlmul bits. I'm going back to correct the text. However, the general idea remains the same. (the vlmul values are the default; what would have been there if valt were zero)

David-Horner commented 4 years ago

see #458 where Krste states:

I don't like the idea of limiting rs1 register names now to leave space for more bits later. As long as one or more immediate fields in vsetvli have a reserved encoding, we can always reclaim rs1 bits later by declaring the reserved encoding to modify the instruction format

The same arguments (and likely aversion) are applicable to the "rd encoding" that extends the immediate values.

The difference in this case is that the bit enabling the effective LMUL values of 3,5,6 and 7 is a transient bit. It does not need to be retained in vtype, and should not be retained any more than the original AVL value in rs1 need to be retained.

However, if an additional bit is willing to be used from the vsetvli immediate field this proposal reduces to the valt encoding being just that bit.

Note: The bit allocated from the immediate field will displace other uses for the bit. The proposal, as modified, does not provide a reserved code (even in conjunction with vlmul) and thus cannot be used as an escape to an "rs1/rd encoding" format.

This revised proposal is consistent with the note in the original posting:

Note: This issues is a proposal that includes two sub proposals.

Expanding LMUL values , and
Encoding additional vtype values in rd field of vsetvli.
I believed the choices for each affects the weightings chosen for the other,
and so I conflate them here. There is less value in rd encoding if there is no
immediate use.
They could be separated if there is significant challenge to either proposal.

I am certainly willing to open a new issue to reflect just the "Expanding LMUL values" if the consensus is that this would be helpful for consideration of the proposal.

David-Horner commented 4 years ago

Although technically not a part of this proposal, the decision to defer until after V1.0 would greatly impact proposal #460.

As we are down to one vsetvli immediate bit, and all used bits consume all combinations as valid, we are at the cusp where we should consider the impacts of expansion.

Expansion requires providing at least one of the following: 1) a vsetvli with a larger vsetvli encoding (one bit can be immediately recovered if we allocate vsetvl elsewhere)
allowing full configuration withing a single instruction. Opcode space is constrained so the expansion would be at best a few bits. 2) variants of the vsetvli instruction that windows successively smaller immediate values into vtype. necessitating splitting full configuration over two (or more) instructions. 3) variants of vsetvli that repurpose rd and/or rs1 as immediate fields, with or without calculating vl. necessitating splitting full configuration over two (or more) instructions. 4) multiple vsetvli-like instructions (this is the mechanism required for custom RVV to change custom vtype bits) necessitating splitting full configuration over two (or more) instructions; unless it introduces a larger footprint vsetvli-like instruction as in option 1. 5) a revision of vsetvli that encodes vtype bits into the rd and/or rs1 fields allowing full configuration withing a single instruction. see #460

Splitting configuration over two immediate format instructions is problematic. It will in some cases cause transient vill state necessitating a mechanism to allow such transition, maintaining at least partial state from prior vsetvli variants (likely relaxing the all bits cleared but vill constraint), or a software fall back to vsetvl method.

Option 1 must be decided before V1.0 (otherwise it becomes an option 4 with two overlapping instructions)

Options 2 through 5 can be deferred until after V1.0 but each have impacts to

hardware fragmentation pressure/necessity to support variants even if the functionality is not present trapping on variants to emulate new variants that are introduced for unused functionality
software ecosystem complexity to address the hardware fragmentation (if they care) e.g. conditional vsetvli variant execution dependent upon "model" optimize the appropriate alternative means to configure a given setting e.g. hoisting one of the variants out of the loop to allow a single immediate vsetvli variant There could be multiple competing two (or more) step split update approaches.

Options 2 through 4 have the additional concerns of:

appropriate design of variants to minimize the hardware/software eco/split configuration effects
extension of vtype model to accommodate split configuration updates

Deferred options 2,3 and 5 need not trap on V1.0 machines as vill would be set if the remaining bit is used to differentiate.

Deferred option 4 will always trap on V1.0 machines.

Option 5 is a novel approach to the problem. If it is deferred until post V1.0

we can expect confusion over two disparate approaches
the encoding into rd/rs1 will subsume the original to avoid split updates . as a result confusion will be compounded as the old format falls out of use, but old code persists . hardware will need to support both increasing what has been considered a critical path on this list even though the old format is obsoleted

riscv / riscv-v-spec