Even with v0.9-draft-20200424 establishes a new meaning for the SLEN parameter which interleaves SEW elements , there is pressure to retain SLEN=VLEN to avoid the byte/half-word/word/double/etc. in-register structure to not match in-memory order.
The new SLEN, and the SLEN before it were introduced to support widening operations.
This is the fundamental challenge, how to widen from a given element width to an effective width efficiently across multiple micro-architectures; avoiding element positioning skew and long wiring lengths.
A2.
So far vertical and horizontal approaches to accommodate double width results from fully packed source registers has proved challenging. Each creates anomalies in the register structure.
The original vertical striping was rigidly allocated powers of 2 register groups as it compounded at each level of widening. The applied striping length partially mitigated but still allowed for in-register structure to mismatch in-memory on a register group and implementation specific SLEN basis.
The current proposal provides horizontal striping at the LMUL=1 and above to mitigate for machines with large VLEN. However, as “smaller” machines will not need this functionality it has the potential to fragment the eco-system. Especially as (H)SLEN < VKEN introduces in-register vs in-memory anomalies. Although there are various proposals to mitigate this disparity, all retain some risk of fragmentation.
A3.
I propose an alternate approach to resolve the widening operation dilema.
It is inherent in the existing #421 fractional fill proposal.
Proposal:
Overview:
The fundamental concept is rather than attempt to accommodate widened results from fully packed sources, instead size and format the sources so that the results can be accommodated in a fully packed target register set.
Specifically:
The two source widening operations only source from fractional register structures.
Fractional structures are defined that are 1/2 and 1/4 populated (per operand) depending upon whether quad or double widening operators are defined. See
The fractional structures allow two components of the structure to be filled and accessed independently.
(it would be an extension to access each component for 1/4, 1/8 etc. )
The structure has modes in vtype that select from these independent segments for widening and load/store.
Register multipliers (LMUL) work on all register structures, providing Effective VLEN of LMUL VLEN. see
So, initially fill ratios are 1, 1/2 and 1/4 to support dual source widening operations and 1/8 to support load/store related enhanced single source widening.
Relevant details from #421:
The new field, vfill, fulfills two distinct purposes: fractional cluster order (fill) and selection (element location).
Corresponding masks segments are active for each selected cluster. see****
These are specific to three cases for fractional data for:
1) For one vector operand instructions: provides the fill degree and order.
Examples:
load/store
vclstr/vdclstr
mask ordinal
narrowing
2) For two operand single-SEW instructions it determines the participating clusters.
Examples:
vadd.vv vadd.vi vfadd.vv
vmseq.vv vmseq.vx
3) For two operand widening instructions it determines the participating clusters.
Examples:
vwadd.vv vwadd.vx vwadd.wv
The structure and values are chosen to minimize the vfill state changes in typical code sequences.
The encoding is independent of LMUL>=1 to allow register groups for all values of LMUL, from 1 to 8. vlmul is reduced to 2 bits, with LMUL values 1,2,4 and 8. Only the power of 2 limits are needed to validate the used register groups. see***
CLSTR and vclstr are defined that determine the clustering size. CLSTR is the minimum size in bytes allocated to successive elements until met or exceeded before moving on to the next cluster chunk. The CLSTR value is defined by the field vclstr is a field vclstr in vcsr.
Although #461 defines cluster size in the context of the horizontal interleaving introduced with v0.9-draft-20200424 , it retains the same definition here. For further details see #461.
Define vfill, a new 3 bit field stored within the lower 11 bits of vsetvli. see*****
The following table specifies the register layout(structure) and sources established with different codes in vfill:
vfill
One vector operator:
Two vector operands
Widening
equivalent to original LMUL
000
X0
X0
~
LMUL=1
Odd:even
Odd:even
Odd:even
001
- X1
- X1
- X1
LMUL=1/2
010
X1 -
X1 -
X1 -
LMUL=1/2
011
~
~
W1 X1
LMUL=1/2
101
- X2
- X2
- X2
LMUL=1/4
110
X2 -
X2 -
X2 -
LMUL=1/4
111
~
Y2 X2
W2 X2
LMUL=1/4 : Note 1
100
- X4
- X4
- X4
LMUL=1/8
Notes:
1 – For vfill=111 two operand, vl counts the pairs of operations.
Legend:
~ not a valid combination (reserved)
"-" gap of size equal to CLSTR size
X0 consecutively numbered elements (clusters with no gaps)
[i+n-1] .... [i+2] [i +1] [i+0] where n is 2 * CLSTR
and i is determined by two cluster boundary.
X1 consecutively numbered elements (clusters with equal size gap)
[i+n-1] .... [i+2] [i +1] [i+0] where n is number elements in a cluster
and i is determined by cluster boundary.
X2 same as X1 except effective cluster size is CLSTR / 2
X1 and X2 can occupy even or odd sides of gap/cluster pair.
X4 same as X1 except effective cluster size is CLSTR / 4
and is only allocated on even side.
Y2 equivalent to X2 but can occupy odd cluster location only.
These odd clusters are processed in tandem with the X even clusters, such that vl * 2 operations are performed.
W1 and W2 are equivalent to X1 and X2 but occupy odd cluster location only.
For widening ops vs1 is sourced from this odd cluster location.
while vs2 is sourced from even cluster location.
When vs1 = vs2 a single physical register sources both operands.
One vector operand instructions:
Load exemplifies the processing. Either the odd or even cluster in the ‘gap/cluster’ or ‘cluster/gap’ pair is chosen by vfill.
For even clusters, elements are filled from the lower bits until the cluster is filled, the gap is skipped and the next cluster filled, etc. until vl is exhausted.
For odd clusters, the initial gap of CLSTR bytes is skipped, the cluster is filled, the rest (if any of the CLSTR bytes is skipped to the next CLSTR gap/cluster pair, and the process repeated until vl is exhausted.
Note: the corresponding bits in V0 are used to mask elements for instruction with vm=0.
The same element numbering derived by load apply to store and all other one vector register instructions.
Single operand widening instructions
The selection is that same as for “One vector operand instructions”.
The vd results are aligned with right or left for source at vfill=100,101 and 110 if the effective vfill level is 101, 001 or 010. If the resultant level is 000 it fills the full 2 cluster chunk.
Similarly, if source is at 001 or 010, then vd effective vfill is 000, and it fills the full 2 cluster chunk.
For two operand single-SEW instructions:
The same element numbering derived by load apply to each vector and the corresponding mask bits whether selected from the even or odd clusters.
For vfill= (100, x01) or x10, both operands for the instruction are selected from (even, even) or odd clusters, respectively, one group of element from each of the two registers vs1 and vs2. The result is stored in the corresponding elements in the (even,even) or odd cluster of vd, respectively.
For vfill=111, two operations occur for each value of vl. The even ( X ) elements are processed as described for vfill=101, with the result written to the element of the even vd cluster. The odd ( Y ) elements are processed as described for vfill=110, with the result written to the element of the odd vd cluster.
Note: the setting vfill=011 is currently reserved for two operand single-SEW operations because it is the same as if the operation were performed with vfill=000.
In all cases the corresponding mask bits in v0 for each used cluster element are in effect.
For two operand widening instructions
For vfill= 100, x01 or x10 double widening instructions select cluster source elements the same way as for two operand single-SEW instructions. However, the corresponding vd is in the next higher vfill level. This even/odd works for vfill=101 and 110 with correspondingly larger even/odd effective vfill=001 and 010 respectively.
When vfill=001 or 010 the vd result is always in the 2 * CLSTR sized group of elements at vfill=000.
For quad widening the result is 2 vfill levels up. For vfill=100 source the result is at vfill=001.
Note, vfill=0xx cannot be quad widened.
For vfill=011 or 111 and double widening operations the operands for the instruction are selected from (both an even and an odd clusters. For vs1 the elements are selected from the even clusters within the register. For vs2 the elements are selected from the odd clusters within the register. The vd result from has an effective vfill one level higher that the sources. The result is thus either vfill=000 with vfill=011 source, or it is vfill=001 for vfill=111 source.
For quad widening vfill=011 is not allowed.
However, vfill=111 is allowed for quad widening an yeilds a vfil=000 vd result.
A4.
Applications that know beforehand when they are going to perform widening operations can readily tailor the input to match those operations.
This design has many favourable characteristics:
the design is not ILEN32 specific, The same concerns occur regardless of instruction encoding and thus the ILEN64 model is not hampered by it.
logical register groups from 1 to 8 are orthogonaliy supported.
Widening using vfill=011 with vs1 = vs2 uses a fully packed physical register group to create another fully packed register group.
two of the same widening instruction executed using vfill=001 and then with vfill=010 will widen a vfill=000 register group into two like sized register groups.
two of the same widening instruction using vfill=011 but exchanging vs1 and vs2 will for many aggregating operations perform a valid step without changing vtype.
Non-widening operations are also fully supported at all levels by this design.
single-SEW operations at vfill=000 for executes on all elements equivalently to performing an odd (vfill=001) then an even (vfill=010) pair of the same operation.
This is a minimal vfill design. More functionality such as full component selection at lower vfill levels (higher interleave levels) notably 1/2 CLSTR and 1/8 CLSTR are possible. #### see****
A5.
This proposal is as radical a departure from vertical striping as horizontal interleave via SLEN<VLEN.
Should we do such a change so near a v0,9 release? (various perception and retooling concerns)
Fractional register dependency implies lesser register usage.
The design (appears to) requires more vsetvli changes. It may require augmenting widening and load/store operations to reduce that effect.
This proposal should be improved by making CLSTR/vclstr programmable.
e.g. when CLSTR is set to EW, then vfill=001 will select the even of an element pair, and vfill-010 will select the odd.
see**
Octal and higher will be defined if such higher order operations are defined.
see*** LMUL should allow all values from 1 to 8 as explained in #460
see****
for simplicity assumes #448 Ordinal based mask, but other mask encode can be compatible.
see*****
But other than odd side support for 1/4 CLSTR size (1/8th EW for “Normal physical register format) which would enable widening from both sides and increase register use, I don’t know if it is worth it.
A1.
Even with v0.9-draft-20200424 establishes a new meaning for the SLEN parameter which interleaves SEW elements , there is pressure to retain SLEN=VLEN to avoid the byte/half-word/word/double/etc. in-register structure to not match in-memory order.
The new SLEN, and the SLEN before it were introduced to support widening operations.
This is the fundamental challenge, how to widen from a given element width to an effective width efficiently across multiple micro-architectures; avoiding element positioning skew and long wiring lengths.
A2.
So far vertical and horizontal approaches to accommodate double width results from fully packed source registers has proved challenging. Each creates anomalies in the register structure.
The original vertical striping was rigidly allocated powers of 2 register groups as it compounded at each level of widening. The applied striping length partially mitigated but still allowed for in-register structure to mismatch in-memory on a register group and implementation specific SLEN basis.
The current proposal provides horizontal striping at the LMUL=1 and above to mitigate for machines with large VLEN. However, as “smaller” machines will not need this functionality it has the potential to fragment the eco-system. Especially as (H)SLEN < VKEN introduces in-register vs in-memory anomalies. Although there are various proposals to mitigate this disparity, all retain some risk of fragmentation.
A3.
I propose an alternate approach to resolve the widening operation dilema. It is inherent in the existing #421 fractional fill proposal.
Proposal:
Overview:
The fundamental concept is rather than attempt to accommodate widened results from fully packed sources, instead size and format the sources so that the results can be accommodated in a fully packed target register set.
Specifically:
The two source widening operations only source from fractional register structures. Fractional structures are defined that are 1/2 and 1/4 populated (per operand) depending upon whether quad or double widening operators are defined. See The fractional structures allow two components of the structure to be filled and accessed independently. (it would be an extension to access each component for 1/4, 1/8 etc. ) The structure has modes in vtype that select from these independent segments for widening and load/store. Register multipliers (LMUL) work on all register structures, providing Effective VLEN of LMUL VLEN. see
So, initially fill ratios are 1, 1/2 and 1/4 to support dual source widening operations and 1/8 to support load/store related enhanced single source widening.
Relevant details from #421:
The new field, vfill, fulfills two distinct purposes: fractional cluster order (fill) and selection (element location).
Corresponding masks segments are active for each selected cluster. see**** These are specific to three cases for fractional data for: 1) For one vector operand instructions: provides the fill degree and order. Examples: load/store vclstr/vdclstr mask ordinal narrowing
2) For two operand single-SEW instructions it determines the participating clusters. Examples: vadd.vv vadd.vi vfadd.vv vmseq.vv vmseq.vx
3) For two operand widening instructions it determines the participating clusters. Examples: vwadd.vv vwadd.vx vwadd.wv
The structure and values are chosen to minimize the vfill state changes in typical code sequences. The encoding is independent of LMUL>=1 to allow register groups for all values of LMUL, from 1 to 8. vlmul is reduced to 2 bits, with LMUL values 1,2,4 and 8. Only the power of 2 limits are needed to validate the used register groups. see***
CLSTR and vclstr are defined that determine the clustering size. CLSTR is the minimum size in bytes allocated to successive elements until met or exceeded before moving on to the next cluster chunk. The CLSTR value is defined by the field vclstr is a field vclstr in vcsr.
Although #461 defines cluster size in the context of the horizontal interleaving introduced with v0.9-draft-20200424 , it retains the same definition here. For further details see #461.
Define vfill, a new 3 bit field stored within the lower 11 bits of vsetvli. see*****
The following table specifies the register layout(structure) and sources established with different codes in vfill:
Notes: 1 – For vfill=111 two operand, vl counts the pairs of operations.
Legend:
~ not a valid combination (reserved)
"-" gap of size equal to CLSTR size
X0 consecutively numbered elements (clusters with no gaps) [i+n-1] .... [i+2] [i +1] [i+0] where n is 2 * CLSTR and i is determined by two cluster boundary.
X1 consecutively numbered elements (clusters with equal size gap) [i+n-1] .... [i+2] [i +1] [i+0] where n is number elements in a cluster and i is determined by cluster boundary.
X2 same as X1 except effective cluster size is CLSTR / 2
X4 same as X1 except effective cluster size is CLSTR / 4 and is only allocated on even side.
Y2 equivalent to X2 but can occupy odd cluster location only. These odd clusters are processed in tandem with the X even clusters, such that vl * 2 operations are performed.
W1 and W2 are equivalent to X1 and X2 but occupy odd cluster location only. For widening ops vs1 is sourced from this odd cluster location. while vs2 is sourced from even cluster location. When vs1 = vs2 a single physical register sources both operands.
One vector operand instructions: Load exemplifies the processing. Either the odd or even cluster in the ‘gap/cluster’ or ‘cluster/gap’ pair is chosen by vfill.
For even clusters, elements are filled from the lower bits until the cluster is filled, the gap is skipped and the next cluster filled, etc. until vl is exhausted.
For odd clusters, the initial gap of CLSTR bytes is skipped, the cluster is filled, the rest (if any of the CLSTR bytes is skipped to the next CLSTR gap/cluster pair, and the process repeated until vl is exhausted.
Note: the corresponding bits in V0 are used to mask elements for instruction with vm=0.
The same element numbering derived by load apply to store and all other one vector register instructions.
Single operand widening instructions
The selection is that same as for “One vector operand instructions”. The vd results are aligned with right or left for source at vfill=100,101 and 110 if the effective vfill level is 101, 001 or 010. If the resultant level is 000 it fills the full 2 cluster chunk. Similarly, if source is at 001 or 010, then vd effective vfill is 000, and it fills the full 2 cluster chunk.
For two operand single-SEW instructions: The same element numbering derived by load apply to each vector and the corresponding mask bits whether selected from the even or odd clusters.
For vfill= (100, x01) or x10, both operands for the instruction are selected from (even, even) or odd clusters, respectively, one group of element from each of the two registers vs1 and vs2. The result is stored in the corresponding elements in the (even,even) or odd cluster of vd, respectively.
For vfill=111, two operations occur for each value of vl. The even ( X ) elements are processed as described for vfill=101, with the result written to the element of the even vd cluster. The odd ( Y ) elements are processed as described for vfill=110, with the result written to the element of the odd vd cluster.
Note: the setting vfill=011 is currently reserved for two operand single-SEW operations because it is the same as if the operation were performed with vfill=000.
In all cases the corresponding mask bits in v0 for each used cluster element are in effect.
For two operand widening instructions
For vfill= 100, x01 or x10 double widening instructions select cluster source elements the same way as for two operand single-SEW instructions. However, the corresponding vd is in the next higher vfill level. This even/odd works for vfill=101 and 110 with correspondingly larger even/odd effective vfill=001 and 010 respectively.
When vfill=001 or 010 the vd result is always in the 2 * CLSTR sized group of elements at vfill=000.
For quad widening the result is 2 vfill levels up. For vfill=100 source the result is at vfill=001. Note, vfill=0xx cannot be quad widened.
For vfill=011 or 111 and double widening operations the operands for the instruction are selected from (both an even and an odd clusters. For vs1 the elements are selected from the even clusters within the register. For vs2 the elements are selected from the odd clusters within the register. The vd result from has an effective vfill one level higher that the sources. The result is thus either vfill=000 with vfill=011 source, or it is vfill=001 for vfill=111 source.
For quad widening vfill=011 is not allowed. However, vfill=111 is allowed for quad widening an yeilds a vfil=000 vd result.
A4.
Applications that know beforehand when they are going to perform widening operations can readily tailor the input to match those operations. This design has many favourable characteristics:
A5.
This proposal is as radical a departure from vertical striping as horizontal interleave via SLEN<VLEN.
Fractional register dependency implies lesser register usage. The design (appears to) requires more vsetvli changes. It may require augmenting widening and load/store operations to reduce that effect. This proposal should be improved by making CLSTR/vclstr programmable. e.g. when CLSTR is set to EW, then vfill=001 will select the even of an element pair, and vfill-010 will select the odd.
see**
Octal and higher will be defined if such higher order operations are defined.
see*** LMUL should allow all values from 1 to 8 as explained in #460
see****
for simplicity assumes #448 Ordinal based mask, but other mask encode can be compatible.
see*****
But other than odd side support for 1/4 CLSTR size (1/8th EW for “Normal physical register format) which would enable widening from both sides and increase register use, I don’t know if it is worth it.
Same 5Q&A boilerplate goes here. -Ed.