Towards a simple fractional LMUL design.

riscv / riscv-v-spec

Working draft of the proposed RISC-V V vector extension

https://jira.riscv.org/browse/RVG-122

Creative Commons Attribution 4.0 International

960 stars 272 forks source link

Towards a simple fractional LMUL design. #393

Closed David-Horner closed 4 years ago

David-Horner commented 4 years ago

Background:

Prior to LMUL, an elaborate mapping of registers numbers to various width element under different configuration settings that allowed for polymorphic operations was proposed.

LMUL was introduced in a pre-v0.5 Nov 2018 in conjunction with widening operations and SEW widths. The LMUL>1 mapping of a register group is one to a power of 2 of consecutive non-overlapping base-arch-registers. The naming uses the lowest base-arch-register participating in the register group. The number of LMUL register is diminished by the same power of 2. This design was substantially less complex than the predecessor, with simple constructs like

LMUL in powers of 2 aligning with the widening by 2 operations. Abandoning previous ideas of sequences like 1,2,3,4,5,6,8,10,16,32
consecutive registers in register groups, aligned and addressed on multiples of LMUL

This issue will look at simplest implementations of fraction LMUL.

Glossary:

base-arch registers – the 32 registers addressable when LMUL=1 register group – consecutive registers determined by LMUL>1 register sub-group – portion of physical register used by LMUL<1 SLEN - The striping distance in bits, VLEN - The number of bits in a vector register, VLMAX – LMUL VLEN / SEW . . no name is given to effective VLEN at different values of LMUL vstart - read-write CSR specifies the index of the first element to be executed by a vector instruction. ( whereas other terms are from the spec these * terms are added for this discussion)

Guidance. Fractional LMUL follows the same rules as for LMUL>=1. VLMAX applies the same.

The simplest extensions to the base retain the fundamental characteristics. Specifically then, for this proposal, ELEN, SEW (and its encoding in vtype), VLEN and, mask register zero and mask operation behaviour are not changed.

The simplest extension of LMUL to “fractional” is that the observe affects continue predictably. Specifically,

for changes in LMUL there is a corresponding change in VLMAX and
fractional LMUL changes by a factor of 2 from adjacent settings.

For LMUL >=1, VLMAX = LMUL VLEN/SEW Note: if SEW is unchanged, with variation of LMUL there is a proportional change in VLMAX. We can multiply both sides by SEW to get LMUL VLEN = VLMAX * SEW.

This table exhaustively represents this simplest extension effect when SEW is unchanged throughout:


LMUL       VLMAX * SEW        

8       8*VLEN
4       4*VLEN
2       2*VLEN
1         VLEN
1/2     VLEN/2
1/4     VLEN/4
1/8     VLEN/8

Fractional registers then have diminished capacity, 1/2 to 1/8th of a base-arch register.

The simplest mapping of fractional LMUL registers is one to one (and only one) of the base-arch registers. All 32 base-arch-registers can participate and register numbering can be the same.

The simplest overlay (analogous to the register group overlay of consecutive base-arch registers) is with zero elements overlaying. That is, the fractional register sub-group occupies the lowest consecutive bytes in the base-arch register. The bytes are in the same ascending order.

I call this iteration zero of the simplest fractional LMUL designs.

Note: Mask behaviour does not change. Mask operations read and write to a base-arch register. Base-arch register zero remains the default mask register. With this "iteration zero" design, as with LMUL>=1, fractional LMUL “register zero”s are substantially limited in their use.

There are some undesirable characteristic of this design.

Use of any fractional sub-group is destructive to the underlying base-arch register. As sub-groups have less capacity than the underlying base-arch register overall usable capacity is also diminished, up to 7/8ths of VLEN for each active sub-group.
Such sub-groups are not optimized for widening operations. There is no equivalent to SLEN to align single with widened operands.

Compare with #382 that presumed a fractional LMUL already existed and that packing of fractional registers would provide a substantial benefit.

David-Horner commented 4 years ago

A slightly less simple design to partially address the destructive nature of register overlay.

There are some undesirable characteristic of iteration zero of the simplest fractional LMUL design.

Use of any fractional sub-group is destructive to the underlying base-arch register.

As sub-groups have less capacity than the underlying base-arch register overall usable capacity is also diminished, up to 7/8ths of VLEN for each active sub-group.

Because the low (zero) elements aligning in the overlay the sub-group is in the active portion of the base-arch register the destructive impact is unavoidable. Similarly, an operation that writes to the base-arch register overwrites at least some of the register sub-group.

I use the term “active” loosely. Technically the active portion is only defined when an operation is acting on the register. Regardless, most of the time VSTART will be zero, and so the active portion would start from zero on the next operation.

However, if instead the VLMAX elements of the base-arch register and the register sub-group are aligned then judicious use of vl can avoid mutual assured destruction. Register names would remain in the one to one correlation. However, the register sub-groups would start at 1/2 VLEN, 3/4 VLEN and 7/8 vlen depending upon fractional LMUL.

VLEN 1/1     7/8     3/4             1/2                            0                                                         
---------------------------------------------------------------------
LMUL
1    xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
1/2  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
1/4  xxxxxxxxxxxxxxxx
1/8  xxxxxxxx

Consider when LMUL=1 and tail-undisturbed is active and VLEN a power of 2. If vl is less than or equal 1/2 VLMAX then a LMUL=1/2 or 1/4 or 1/8 register sub-group is fully in the tail of the base-arch register. Similarly, with vl of 3/4 VLMAX or less then the tail fully encompasses a LMUL=1/8 register sub-group. The same applies for vl <= 7/8 VLMAX for LMUL=1/16 register sub-group.

VLEN    1/1     7/8     3/4             1/2                            0                                                         
------------------------------------------------------------------------
LMUL
1/2reg  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx---------vl <= 1/2 VLMAX--------
1/4reg  xxxxxxxxxxxxxxxx------------------vl <= 3/4 VLMAX---------------
1/8reg  xxxxxxxx-------------------vl <= 7/8 VLMAX----------------------

In the perfect scenario registers will all be used to their maximum with fractional LMUL support.

Note: Up to 32 base-arch registers and 96 register sub-groups can be "alive" at a given time. Only 32 can be active at a time, with a single vlset[i] instruction enabling each set of 32.

With appropriate values of SLEN, LMUL>1 can also use the reduced vl to allow consecutive fractional register sub-groups to co-exist.

Nor is the technique restricted for LMUL >=1. LMUL=1/2 can tail protect 1/4 and 1/8th; and 1/4 LMUL tail protect 1/8th.

Note: **Masks can be better managed with this design.** As with non-mask registers, appropriate vl allows the tail of a mask register, including mask register zero, to be used for fractional register sub-groups. ( better than “iteration zero” design.) Further, the fractional register sub-group can store the current LMUL significant mask bits with a single instruction:
vsbc.vvm vn, v0, v0, vmx 
# where vn is destination fractional register, vmx is any mask register.   
# vn[i] =  if (vmx's mask bit set) -1 else 0  
# v0,v0 could be any registers designation as long as they are the same
#   (or the registers have the same contents). 
and a single instruction to enable it in mask v0.
vmsne.vi v0,vn,0  # 
Note: Not only algorithms that widen/marrow may benefit. Algorithms with non-power of two data usage (consider Fibonacci based structures) may especially. The fractional sub-groups allow residence of additional data (of any SEW) and operations on them to proceed in the unused tail sections of base-arch register .

Note: *Implementations with VLEN that are not a power of 2 (say 3 2 n ) could provide the best of both worlds. Algorithms working optimally with vl at a power of 2 and fractional operations in the remaining tail. (And of course one can mix and match on a base-arch or even sub-group basis).

However, this is still not fully ideal.

Whereas many algorithms may tolerate a halving of vl, some will require substantial moderation to support a 3/4 or 7/8 VLMAX.
Hardware support is complicated by register sub-groups starting at non-zero. Those with comprehensive vstart support can “just” adjust it appropriately.
It does not address the SLEN optimization deficiency already noted. Indeed it could in some implementation aggravate the situation as propagation for widening and narrowing operations are in the opposite directions for each vs LMUL > 1.
Minimal implementations benefit least from this; consider, for example, that 7/8ths VLMAX is only meaningful when VLMAX is greater than 7.

rofirrim commented 4 years ago

Hi David,

thanks for the detailed proposals (If i read your two comments above correctly, this issue contains two proposals).

In the simple (first) proposal you mention

Such sub-groups are not optimized for widening operations.

Mind to elaborate what you meant by that?

I see LMUL<1 useful in a situation like the following:

We (a human or the the compiler) decide to vectorize a mixed-type loop that uses double (SEW=64) and float (SEW=64) data using LMUL=1 (because LMUL>1 reduces the number of registers and this loop needs all the available architectural registers we can have). We use the same number of elements for both SEW=64 and SEW=32. This means in practice for SEW=32 we use half of the register.

At some point we need to widen (half a) register in SEW=32 to a (whole) register in SEW=64 or the opposite. Let's focus on the widening but I understand the narrowing is the dual case.

As of today, where fractional LMUL does not exist, and because of the existence of SLEN, widening from SEW=32 to SEW=64 will (when SLEN != VLEN) leave the widened elements scattered (in an interleaved fashion) among two registers. Before I can continue to operate those elements as SEW=64 under LMUL=1 I need to reorganise the elements back into a single register. The best approach we've found so far (perhaps there are better ways to do this) involves a number of vrgather and vid.

[ Note: One could argue that the reorganisation is not actually needed but seems wasteful to use two registers to represent half the number of elements. ]

Just to be clear what I meant in the paragraph above, I want to do this

// v2 will contain SEW=64 values
// half of v4 (this is, up to vlmax_64-1) contains SEW=32 values 
v2[0:vlmax_64-1] = widen_32_to_64(v4[0:vlmax_64-1])

With the current widening approach, the result of widen_32_to_64 is actually scattered in two registers v2 and v3. So the above operation actually looks like this

v2[ ... ], v3 [ ... ] = widen_32_to_64(v4)
v2 = reorganize_elements(v2, v3)
continue operating with v2 under LMUL=1

The reorganize_elements step is needed if SLEN != VLEN. If SLEN = VLEN then all the elements of v4 that I cared about are already widened in v2 (at the expense of having to clobber v3 for a moment, but that should be bearable).

With LMUL<1 the operation above can be done much easier as we can set vtype to be LMUL=1/2 and do a widening (which should not be impacted by SLEN, shouldn't it?).

LMUL ← 1/2
v2 = widen_32_to_64(v4)
LMUL ← 1
continue operating with v2 under LMUL=1

This looks like a win to me because I understand the reorganize_elements operation above is much more complex than just switching for a moment to LMUL=1/2 to avoid the interleaving effect of SLEN.

A similar argument can be done with quad conversions and LMUL=1/4 (imagine working with SEW=16 and SEW=64).

I understand that LMUL<1 may explicitly mean to waive away to part of the register (as we can't really name the other parts) but this seems a fair price to me when we can't really upscale LMUL (i.e. LMUL>1) in a mixed types situation.

Does this make sense or did you envision a different use case here?

Kind regards,

jnk0le commented 4 years ago

Maybe we should consider fractional LMUL as something that, in future, can map base-arch registers to appropriate register groups in V++ state.

Lets say, V++ has 128 addresable registers (according to the leaks), then:

equivalent LMUL will be 4 times larger than the one assumed by 32bit encoding

RVV LMUL	#groups	RVV++ LMUL	#groups
1/8	32	1/2	128
1/4	32	1	128
1/2	32	2	64
1	32	4	32
2	16	8	16
4	8	16	8
8	4	32	4

base-arch source/dest register specifiers can be treated as shifted into position [6:2] of equivalent register

base-arch reg	RVV LMUL	used RVV++ registers
v1	1/4	v4
v1	1/2	v4:v5
v1	1	v4:v7
v2	2	v8:v15
v16	8	v64:v95

the whole "mapping to underlying V++ state" need to be done only when implementing V++, otherwise it will be just already proposed simplest design

In this way, there is less legacy penalty and fractionally wasted space is still reachable by 64bit opcodes.

David-Horner commented 4 years ago

Proposal to introduce register groups, SLEN and fractional SLEN into the simple register fractional LMUL model.

What has not changed: Fractional LMUL will

still be denoted as 1/2, 1/4, 1/8.
will continue to halve the VLEN to match its denotation
conceptually LMUL=1 is still adjacent to 1/2

New, but fundamentally the same as for LMUL>1

Fractional groups (Striped groups of fraction registers). A striped group of fractional registers (a fractional group) parallels LMUL>1 registers, in that:

the number of fractional registers in the group is a power of 2
the group is aligned on a multiple of the group size
all fractional registers in the group are of the same bit length.
elements are filled from 0 to vl in a round robin beginning at lowest register number.
. . filling proceeds to the next register after a striping number of bits are met.

The rest of this proposal talks about what has changed (even if some subtly).

Some convenient definitions:

Define “SEW-instructions” as those that vs1, vs2 and vd match SEW from vtype. To clarify, they are not: widening or narrowing whole register moves mask register only

Introduce register group characterization: This proposal allows fractional groups to originate a multiple levels with their width determined by that level. For example fractional groups with a physical width of VLEN/8 originated at LMUL=1/8. A short hand to identify such groups will make the narrative much more readable.

Consider LMUL>=1 register groups. They all start in LMUL=1 via a widening operation. So 1 should be in their designation even though it is superfluous without fractional LMUL.

Consider n:m format where VLEN/n is the vector length and m is number of base-arch registers in the group. Then we designate

LMUL=2 addressable registers as 1:2
LMUL=4 addressable registers as 1:4
LMUL=8 addressable registers as 1:8 and
LMUL=1 addressable registers are 1:1 (for completeness)

In the previously presented simple mappings of fractional LMUL, *there was a presumptive understanding that widening operation sourced LMUL=1/n registers widen to LMUL=2 (1/n) registers.**

This would be represented by a table such as this:

LMUL	1/8	1/4	1/2	1	2	4	8
------------
group type
1:8						x	a=0,8,16,24
1:4					x	a=0,4,8,12 ...
1:2				x	a=0,2,4,6, ...
1:1			x	a=all
2:1		x	a=all
4:1	x	a=all
8:1	a=all

a = Accessible at LMUL level by SEW instructions x = Created by widening instructions at LMUL level (Narrowing instructions also source from this LMUL)

Note: 16:1 is intentionally omitted from the diagram although it works the same.

This proposal acknowledges that such a simplistic approach can be inefficient for many reasonable implementations. It also acknowledges that some mandatory RVV instructions are comparably inefficient. vgather , slideup/down, and others similarly have to operate across lanes. And further that striped register support is already present in the base design.

So this proposal introduces fractional groups beginning with table:

LMUL	1/16	1/8	1/4	1/2	1	2	4	8
------
group type
1:8						x	a= 0,8, 16,24
1:4						x	a= 0,4, 8,12 ...
1:2				x	a=0,2, 4,6, ...
1:1				a=all
16:8			x	a= 0,8, 16,24
16:4		x	a= 0,4, 8,12 ...
16:2	x	a=0, 2,4,6, ...
16:1	a= all
8:1		a= odd
4:1		a= odd
2:1			a= odd
LMUL	**1/16	1/8	1/4	1/2	1	2	4	8

This is the same legend as above and will be assumed for all further diagrams: a = Accessible at LMUL level by SEW instructions x = Created by widening instructions at LMUL level (Narrowing instructions also source from this LMUL)

Note: 8:1 , 4:1 and 2:1 were added to the table though technically not required to illustrate fraction groups. More below.

This has two undesirable features. Both of which present trade-offs

LMUL now determines both the levels fractional size and the fractional group's size

Registers used for fractional groups are not available for fractional registers (halved at the first level)
there was no need to provide addressing to other registers in LMUL>=1 as all registers were the same physical length.

The smallest fractional register size is used as the base for LMUL grouping

this is necessary to achieve 8 levels of grouping
- however, the usefulness of the smallest vector register is generally limited to small element size.

Although it is possible to provide an even wider LMUL or additional fields in vtype to facilitate more states to address these concerns, the approach here will be to enlist the register numbers to provided context information.

Fist note that at any level the register numbers used by register groups are specific. In LMUL>=2 the only operands available to any operation (including widening and narrowing) were register groups. Widening to 1:8 can only be performed with 1:4 inputs. Converse for narrowing. Widening to 16:8 must use 16:4 inputs to parallel that behaviour. Taking both these observation together the comparable behaviour constraint can be incorporated into the instruction decoding using register addresses.

This allows widening to originate at other levels concurrently, as diagramed here:

LMUL	1/16	1/8	1/4	1/2	1	2 ...
------
group type
1:2				x	a=0,2, 4,6, ...
1:1				a=all
16:8			x	a= 0,8, 16,24
16:4		x	a= 0,4, 8,12 ...
16:2	x	a=0, 2,4,6, ...
16:1	a=all
8:8				x
8:4		x	a= 4,12, 18,20, ...
8:2		x	a= 2,6, 10,14, ...
8:1		a= odd
4:4				x
4:2			x	a= 2,6, 10,14, ...
4:1		a= odd
2:2				x
2:1			a= odd
LMUL	**1/16	1/8	1/4	1/2	1	2

Note: I dropped LMUL=4 and 8 only from the illustration. Note: 16:8 is addressable (from LMUL=1/2), but 8;8, 4:4 and 2:2 are not addressable from LMUL=1. They are however addressable from widening and narrowing instructions from LMUL=1/2.

To be continued ......

David-Horner commented 4 years ago

Lets consider in detail some rows from the last table: LMUL	1/16	1/8	1/4	1/2	1	2 ...

.... group type | 16:2 | x | a=0, 2,4,6, ... 8:1 | | a= odd | | |
.... 16:4 | | x | a= 0,4, 8,12 ... 8:2 | |x | a= 2,6, 10,14, ... .... 16:8 | | | x |a= 0,8, 16,24 8:4 || | x | a= 4,12, 18,20, ... 4:2 | | | x | a= 2,6, 10,14, ... .... 8:8 | | | | x | | | |
4:4 | | | | x | | | |
2:2 | | | | x | | | |
LMUL | 1/16 |1/8 | 1/4 | 1/2 | 1 | 2**

The most prominent feature is the register numbers, especially for LMUL=1/4 and 1/2 that have to be extensively shared.

Detailing the register group addressibility in given fractional LMUL

Looking first at column LMUL=1/8 we see a nice division of 16:2group and 8:1group addresses. Exactly what we might expect, the same as LMUL=2 where half the registers are in register groups. Unlike LMUL=2, the unused register addresses are used, specifically to address the LMUL=1/8 fractional registers (8:1group).

Looking next at column LMUL=1/4 there is a three way division of addresses. The allocation to 16:4 is the expected multiples of 4, just as LMUL=4. The allocation to 4:1 is also expected as expected, the odd addresses are for this level fractional registers. However, 8:2 does not have all the usual addresses available, because they are already used by 16:4. So 4:1 has 16 register addresses, 8:2 has 8 register addresses as does 16:4.

And finally, looking at column LMUL=1/2 there are four addressing groups. 2:1 has all the odd addresses as expected (so not on the chart). 16:8 has the expected four multiples of eight addresses. 8:4 has the multiples of 4 addresses minus the multiples of 8 (reserved for 16:8) 4:2 therefore has the remaining multiples of 2 addresses So 2:1 has 16 register addresses, 4:2 has 8, 8:4 has 4 and 16:8 has 4

I question the value of providing the 16:mgroups. Eliminating this would enable more registers for other groups. However, for now we will continue to consider it.

Use of groups that are not “SEW instruction” addressable from any LMUL level

I included 2:2, 4:4 and 8:8 groups even though they conceptually exist in the LMUL=1 level and are not addressable by LMUL=1 (for reasons explained before).

This would be like allowing LMUL=8 widening and narrowing instructions. These instructions would write to one of 2 register groups at 0 and 16, each using 16 base-arch registers. I propose such a change in #397

This is especially useful in fractional LMUL as more address space is available to the LMUL=1/2 instructions. Further, otherwise there is no widening narrowing for LMUL=1/2 through fractional groups.

Mixed group type usage

SEW instructions have access to 16 fractional registers and 16 fractional groups at all levels except LMUL=1/16. This raises the question of interoperability of these mix matched structured operands. I propose:

For op.vv SEW instructions

the register numbers identify the category of each argument.
when both input operands are of the same category the arguments are processed accordingly
when operands category mismatch the lowest grouping is used for both arguements.

For op.vs and op.vx instructions

the vector operand category is determined by its register number

For widening instructions the same rules hold. The vd must be a multiple of twice the input category depth. But it is not constrained to be one of those addressable from the next higher LMUL.

For narrowing operations,

the source register number determines the category using the next higher LMUL
the target register determines the category using the current LMUL.

SLEN and fracSLEN will be in the next installation:

to be continued .....

David-Horner commented 4 years ago

I apologize for the delay in responding.

Your insight is important. My response is below.

On 2020-03-16 12:08 a.m., jnk0le wrote:

Maybe we should consider fractional |LMUL| as something that, in future, can map base-arch registers to appropriate register groups in V++ state.

I agree that it is beneficial to consider what 64bit mapping will provide.

This is how I introduced the register remapping proposal in #382

It is almost certain that 64bit extension (RV++) will provide interoperable backward compatibility to 32bit encoding. Hence the 32bit vtype register and its lmul & sew encodings as defined today will have an effect on RVV evolution.

RVV has already imposed micro-architectural considerations in LMUL. These may be seen with the same derision as some show the delayed branch of earlier RISC designs. But the appropriate tradeoffs for the current environment is the needful balancing act.

Having said all that, the current intent is to have extended LMUL encoding explicitly present in the RVV++ opcodes. RVV++ LMUL encoding would then not be a separate entity.

However, to your point, extending the number of registers does simplify a consideration regarding internal structure of registers that are widened and allows a deepening consideration into the fractional LMUL realm.

Let us consider

RVV LMUL | #groups | RVV++ LMUL | #groups

-- | -- | -- | --

1/8 | 32 | 1/2 | 128

-- | -- | -- | --

...

1/2 | 32 | 2 | 64

-- | -- | -- | --

1 | 32 | 4 | 32

...

8 | 4 | 32 | 4

-- | -- | -- | --

The last two columns of RVV LMUL=8 imply there are 4 groups of 32 registers.

Assuming the same addressing convention used by 32bit encoding (that the lowest numbered register in a group is the name for the register set), register v0 has a depth of 32 accessing 32 striped registers.

Ouch, all of the 32bit encoding arch-base registers are usurped by a single v0 widening operation.

This can of course be avoided by 64bit not using v0 through v31 concurrently with 32bit encoding.

For RVV++ LMUL=32 this is a necessity.

However, for RVV++ LMUL<=16 this is not the case.

Heterogeneous register layouts, specifically when LMUL=8, RVV++ LMUL=32 .

Other things become significantly interesting,

There now exists heterogeneous register layouts. Those in v0-v31 and those in v32 through v128.

Are these register sets allowed to interact?

Can 64bit opcode add, say, v1 and v3 to v32 to fill in the lower 1/32 elements?

What about allowing v1 to write to any of the 46 odd registers above v32?

Would that be a VLEN move, as if v1 was using LMUL=1?

What of add v2 and v4 as if in LMUL=2, to any of the 46 even register above v32 not also a multiple of 4 or 8?

Of course 64bit can avoid the waste of register names and allow these combinations (and more) by having source and target specific “lmul” settings encoded in the instruction. With a 7bit register field, adding 4 more bits per register, 33 bits in total, is not the likely approach. A compressed format would be used as many of the combinations are of extremely low usefulness.

If this is indeed the future for 64bit encoding, then there is no RVV++ LMUL as such, no associations between lmul in vtype and RVV++ behaviour, nor an extended vtype for RV++ use. Thus the chart provides a false association for 64bit encoding. However, it can be constructive for future 32bit compressed access to consider such an association.

So what of 32bit “compressed” encoding? The reserved registers in LMUL>2 can be used for exactly this cross heterogeneous groups purpose, with likely examples suggested above.

Alternatively, or additionally as a subset, if 64bit can directly address fractional registers, then 32bit ops could also address them with the “reserved” registers.

This problem is not just present for future 64bit systems with 128 registers.

These same considerations present themselves with fractional LMUL which has a substantially partitioned and segregated encoding of fractional VLEN registers overlaying the base-arch registers.

A step back to look at benefits and limitations of lmul (and other vtype settings).

It is important to visualize what all the potential mappings are.

RVV has no provision like RISCV float registers that provide a measure of resilience against unintended use.

But more like a region of memory, at any point in time any of the base-arch registers can be in a state consistent with any LMUL with no indicator of which register grouping it “belongs”.

Similarly for fractional LMUL.

As currently formulated, the LMUL>=1 and SEW default characteristics of uniform sized striped register groups and uniform element width is applicable to most instructions. There are exceptions: the total disregard by whole register moves, the group size and specified element width ignored and by mask registers. And notably the involvement of SEW / LMUL combinations for widening and narrowing operations, but only for adjacent levels. As notable, the register groups only progress from one to many. There is no definition of a smaller striped register groups than 1.

A substantial benefit of group striping is that it scales wrt VLEN.

Even a minimal implementation can benefit from register groups.

I consider this important to maintain in a fractional LMUL implementation.

The striped grouping was designed and only works across fixed sized portions of bits.

The term sub-group has been applied to the fractional registers that are envisioned to support the same kind of scaling provided by striping LMUL. However, there is an inherent discontinuity between the sub-groups and LMUL= 1.

There is a similar discontinuity between each of the fractional sub-group levels. The striped register group approach cannot proceed over in in one direction, towards

The register groups are aligned on a multiple of a power of 2 boundary and named in like fashion by the lowest register in a group. This in itself limits the addressibility of a register group as an identifiable collection of grouped register. For example v0 can be referenced as a register group of 1,2,4 or 8 base-arch registers. Even v2 is anbigous between a group of 1 or 2. There is no provision for direct accessing on therefore a coupling of register group size and addressibility of limited reach for operations to access disjoint LMUL structures.

Note: Movements between register with differing LMUL characteristics entail striping and de-striping.

Heterogeneous register layouts, specifically when LMUL=1, RVV++ LMUL=4

Continuing to considering the hypothetical “RVV++ LMUL” setting that we dismissed for other than illustrative purposes.

We see idiosyncrasies with the LMUL=1 as well.

LMUL=1, RVV++ LMUL=4

Because RVV++ registers groups are only 4 deep, 32bit and 64bit operations can reasonably co-exist.

Further, heterogeneous register layouts are directly addressable by 32bit ops, the same as register groups generated from LMUL=4.

These register groups of 4 are the result of widening operations from register groups of 2 from 64bit ops (or 32bit ops when in LMUL=2).

Note the progression of LMUL>=1 is always works away from a base of 1. Fractional bases are not possible.

Once again, 64bit ops can mix operands but 32bit ops would need an extension to existing approach.

Under 32bit instructions, widening operations can target some register

= v32 using odd register designations.

Heterogeneous register layouts, specifically when LMUL=1/2, RVV++ LMUL=2

I haven’t reconciled the latest #393 mapping and fractional group proposal.

I believe it is useful to do so as you directed.

Thank you again for your input to this topic.

I am happy to hear your further comments.

David-Horner commented 4 years ago

I apologize for the delay in responding.

I believe the next two parts to the iteration of the "simple fractional LMUL design" provide a context to enhance the reply.

On 2020-03-12 4:06 a.m., Roger Ferrer Ibáñez wrote:

Hi David,

thanks for the detailed proposals (If i read your two comments above correctly, this issue contains two proposals).

In the simple (first) proposal you mention
Such sub-groups are not optimized for widening operations.
Mind to elaborate what you meant by that?

The second proposal is an evolution of the first.

A two part (so far) iterative enhancement that now includes striped "fractional groups" (that behave like LMUL>1 register groups) is intended to address the optimization issue.

An alternative approach to fractional groups of "spacing out" (and possibly interleaving) source of widening operations has been discussed (and proposed in the original #382 issue "the differing nature of LMUL > 1 and fractional LMUL" https://github.com/riscv/riscv-v-spec/issues/382#). It is still possible to apply in this evolving fractional LMUL design.

I see |LMUL<1| useful in a situation like the following:

We (a human or the the compiler) decide to vectorize a mixed-type loop that uses double (|SEW=64|) and float (|SEW=64|) data using |LMUL=1| (because |LMUL>1| reduces the number of registers and this loop needs all the available architectural registers we can have). We use the same number of elements for both |SEW=64| and |SEW=32|. This means in practice for |SEW=32| we use half of the register.

At some point we need to widen (half a) register) in |SEW=32| to a (whole) register in |SEW=64| or the opposite. Let's focus on the widening but I understand the narrowing is the dual case.

As of today, where fractional |LMUL| does not exist, and because of the existence of |SLEN|, widening from |SEW=32| to |SEW=64| will (when |SLEN != VLEN|) leave the widened elements scattered (in an interleaved fashion) among two registers. Before I can continue to operate those elements as |SEW=64| under |LMUL=1| I need to reorganise the elements back into a single register. The best approach we've found so far (perhaps there are better ways to do this) involves a number of |vrgather| and |vid|.

[ Note: One could argue that the reorganisation is not actually needed but seems wasteful to use two registers to represent half the number of elements. ]

Just to be clear what I meant in the paragraph above, I want to do this

|// v1 will contain SEW=64 values // half of v2 (this is, up to vlmax_64-1) contains SEW=32 values v2[0:vlmax_64-1] = widen_32_to_64(v4[0:vlmax_64-1]) |

With the current widening approach, the result of |widen_32_to_64| is actually scattered in two registers v2 and v3. So the above operation actually looks like this

|v2[ ... ], v3 [ ... ] = widen_32_to_64(v4) v2 = reorganize_elements(v2, v3) continue operating with v2 under LMUL=1 |

The |reorganize_elements| step is needed if |SLEN != VLEN|. If |SLEN = VLEN| then all the elements of |v4| that I cared about are already widened in |v2| (at the expense of having to clobber |v3| for a moment, but that should be bearable).

OK. I believe I follow, and I agree there is a concern.

With |LMUL<1| the operation above can be done much easier as we can set vtype to be |LMUL=1/2| and do a widening (which should not be impacted by |SLEN|, shouldn't it?).

|LMUL ← 1/2 v2 = widen_32_to_64(v4) LMUL ← 1 continue operating with v2 under LMUL=1 |

This looks like a win to me because I understand the |reorganize_elements| operation above is much more complex than just switching for a moment to |LMUL=1/2| to avoid the interleaving effect of |SLEN|.

I agree that this is the apparent implicit default behaviour.

Some have argued that this is widening is dead on arrival. It is seen as inherently inefficient and unacceptable to be the (only) way widening occurs.

The last two "installment" talk directly to these points.

A similar argument can be done with quad conversions and |LMUL=1/4| (imagine working with |SEW=16| and |SEW=64|).

Yes.

I understand that LMUL<1 may explicitly mean to waive away to part of the register (as we can't really name the other parts) but this seems a fair price to me when we can't really upscale LMUL (i.e. LMUL>1) in a mixed types situation.

OK.

Does this make sense or did you envision a different use case here?

I don't consider this the preferred use case, although it is a part of the overall approach.

Rather, widening operations should also, as much as possible, perform the algorithms (binary) functional operations.

Thank you very much for your comments which reinforce the presumptive widening nature of fraction LMUL.

Kind regards,

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/riscv/riscv-v-spec/issues/393#issuecomment-598057400, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFAWIKIB5S6L4X7XK2YITLTRHCJZLANCNFSM4LER5CUA.

respasa commented 4 years ago

I would appreciate if someone could post a bit of the "why? / rationale" for at least one example code for fractional LMUL. For example, just take LMUL=1/2. How would SW take advantage of it?

jnk0le commented 4 years ago

I apologize for the delay in responding. Your insight is important. My response is below. On 2020-03-16 12:08 a.m., jnk0le wrote: Maybe we should consider fractional |LMUL| as something that, in future, can map base-arch registers to appropriate register groups in V++ state. I agree that it is beneficial to consider what 64bit mapping will provide. This is how I introduced the register remapping proposal in #382 It is almost certain that 64bit extension (RV++) will provide interoperable backward compatibility to 32bit encoding. Hence the 32bit vtype register and its lmul & sew encodings as defined today will have an effect on RVV evolution. RVV has already imposed micro-architectural considerations in LMUL. These may be seen with the same derision as some show the delayed branch of earlier RISC designs. But the appropriate tradeoffs for the current environment is the needful balancing act. Having said all that, the current intent is to have extended LMUL encoding explicitly present in the RVV++ opcodes. RVV++ LMUL encoding would then not be a separate entity. However, to your point, extending the number of registers does simplify a consideration regarding internal structure of registers that are widened and allows a deepening consideration into the fractional LMUL realm. Let us consider RVV LMUL | #groups | RVV++ LMUL | #groups … -- | -- | -- | -- 1/8 | 32 | 1/2 | 128 -- | -- | -- | -- ... 1/2 | 32 | 2 | 64 -- | -- | -- | -- 1 | 32 | 4 | 32 ... 8 | 4 | 32 | 4 -- | -- | -- | -- The last two columns of RVV LMUL=8 imply there are 4 groups of 32 registers. Assuming the same addressing convention used by 32bit encoding (that the lowest numbered register in a group is the name for the register set), register v0 has a depth of 32 accessing 32 striped registers. Ouch, all of the 32bit encoding arch-base registers are usurped by a single v0 widening operation. This can of course be avoided by 64bit not using v0 through v31 concurrently with 32bit encoding. For RVV++ LMUL=32 this is a necessity. However, for RVV++ LMUL<=16 this is not the case. Heterogeneous register layouts, specifically when LMUL=8, RVV++ LMUL=32 . Other things become significantly interesting, There now exists heterogeneous register layouts. Those in v0-v31 and those in v32 through v128. Are these register sets allowed to interact? Can 64bit opcode add, say, v1 and v3 to v32 to fill in the lower 1/32 elements? What about allowing v1 to write to any of the 46 odd registers above v32? Would that be a VLEN move, as if v1 was using LMUL=1? What of add v2 and v4 as if in LMUL=2, to any of the 46 even register above v32 not also a multiple of 4 or 8? Of course 64bit can avoid the waste of register names and allow these combinations (and more) by having source and target specific “lmul” settings encoded in the instruction. With a 7bit register field, adding 4 more bits per register, 33 bits in total, is not the likely approach. A compressed format would be used as many of the combinations are of extremely low usefulness. If this is indeed the future for 64bit encoding, then there is no RVV++ LMUL as such, no associations between lmul in vtype and RVV++ behaviour, nor an extended vtype for RV++ use. Thus the chart provides a false association for 64bit encoding. However, it can be constructive for future 32bit compressed access to consider such an association. So what of 32bit “compressed” encoding? The reserved registers in LMUL>2 can be used for exactly this cross heterogeneous groups purpose, with likely examples suggested above. Alternatively, or additionally as a subset, if 64bit can directly address fractional registers, then 32bit ops could also address them with the “reserved” registers. This problem is not just present for future 64bit systems with 128 registers. These same considerations present themselves with fractional LMUL which has a substantially partitioned and segregated encoding of fractional VLEN registers overlaying the base-arch registers. A step back to look at benefits and limitations of lmul (and other vtype settings). It is important to visualize what all the potential mappings are. RVV has no provision like RISCV float registers that provide a measure of resilience against unintended use. But more like a region of memory, at any point in time any of the base-arch registers can be in a state consistent with any LMUL with no indicator of which register grouping it “belongs”. Similarly for fractional LMUL. As currently formulated, the LMUL>=1 and SEW default characteristics of uniform sized striped register groups and uniform element width is applicable to most instructions. There are exceptions: the total disregard by whole register moves, the group size and specified element width ignored and by mask registers. And notably the involvement of SEW / LMUL combinations for widening and narrowing operations, but only for adjacent levels. As notable, the register groups only progress from one to many. There is no definition of a smaller striped register groups than 1. A substantial benefit of group striping is that it scales wrt VLEN. Even a minimal implementation can benefit from register groups. I consider this important to maintain in a fractional LMUL implementation. The striped grouping was designed and only works across fixed sized portions of bits. The term sub-group has been applied to the fractional registers that are envisioned to support the same kind of scaling provided by striping LMUL. However, there is an inherent discontinuity between the sub-groups and LMUL= 1. There is a similar discontinuity between each of the fractional sub-group levels. The striped register group approach cannot proceed over in in one direction, towards The register groups are aligned on a multiple of a power of 2 boundary and named in like fashion by the lowest register in a group. This in itself limits the addressibility of a register group as an identifiable collection of grouped register. For example v0 can be referenced as a register group of 1,2,4 or 8 base-arch registers. Even v2 is anbigous between a group of 1 or 2. There is no provision for direct accessing on therefore a coupling of register group size and addressibility of limited reach for operations to access disjoint LMUL structures. Note: Movements between register with differing LMUL characteristics entail striping and de-striping. Heterogeneous register layouts, specifically when LMUL=1, RVV++ LMUL=4 Continuing to considering the hypothetical “RVV++ LMUL” setting that we dismissed for other than illustrative purposes. We see idiosyncrasies with the LMUL=1 as well. LMUL=1, RVV++ LMUL=4 Because RVV++ registers groups are only 4 deep, 32bit and 64bit operations can reasonably co-exist. Further, heterogeneous register layouts are directly addressable by 32bit ops, the same as register groups generated from LMUL=4. These register groups of 4 are the result of widening operations from register groups of 2 from 64bit ops (or 32bit ops when in LMUL=2). Note the progression of LMUL>=1 is always works away from a base of 1. Fractional bases are not possible. Once again, 64bit ops can mix operands but 32bit ops would need an extension to existing approach. Under 32bit instructions, widening operations can target some register = v32 using odd register designations. Heterogeneous register layouts, specifically when LMUL=1/2, RVV++ LMUL=2 I haven’t reconciled the latest #393 mapping and fractional group proposal. I believe it is useful to do so as you directed. Thank you again for your input to this topic. I am happy to hear your further comments.

Thanks for expanding it into various cases.

What I actually thought about that mapping, was the possibility to use the whole regfile capacity by "legacy" software, which is about to be punished for "not being aware of extra registers". So the V++ won't come with a mandatory compromise of expanding PRF from like 4kiB to 16KiB, reducing VLEN to accomodate extra architectural registers or even extra weird state transitions for non-legacy software.

Completely separating V++ registers from base-arch , for me, is a no-go.

David-Horner commented 4 years ago

On 2020-03-21 10:51 p.m., jnk0le wrote:

What I actually thought about that mapping, was the possibility to use the whole regfile capacity by "legacy" software, which is about to be punished for "not being aware of extra registers".

There are a number of approaches that could be used, but I agree with you that 32bit constrains 64bit.

An approach is to expand the instruction to one that sets the mode to concatenate the 4 consecutive registers, shifts the 32bit register 5 bits left by two zero fill, use them as the base register and beginning at that location as one logical register that has 32bit aware VLEN = 4 * 64bit VLEN. This would be transparent 32bit, they would only see a register 4 times as long, but 64bit could individually use the physical registers in that concatenation. Note: such physical register concatenations would also be a available in the 64bit instruction encoding.

So the V++ won't come with a mandatory compromise of expanding PRF from like 4kiB to 16KiB, reducing VLEN to accomodate extra architectural registers or even extra weird state transitions for non-legacy software. Yes. I believe such a scenario can be avoided. (And my next proposal removes the SLEN visibility, so if adopted, even less likely to have quirks. ;)

Completely separating V++ registers from base-arch , for me, is a no-go.

Agreed. And many would agree with you.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/riscv/riscv-v-spec/issues/393#issuecomment-602139902, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFAWIKJC34QLB2NEPWTWMODRIV4LTANCNFSM4LER5CUA.