Allow zeroing tail as an implementation option

riscv / riscv-v-spec

Working draft of the proposed RISC-V V vector extension

https://jira.riscv.org/browse/RVG-122

Creative Commons Attribution 4.0 International

953 stars 271 forks source link

Allow zeroing tail as an implementation option #367

Closed rofirrim closed 4 years ago

rofirrim commented 4 years ago

We (at the BSC) are aware of the trade-offs described for implementations when it comes to choose undisturbed tail or zeroing tail. However we believe the option of implementing zeroing of the tail in the V-extension has to exist. In particular for implementations tailored for the High-Performance Computing (HPC) market, undisturbed tail poses a problem for implementations using renaming "as additional cycles are required to read out old tail elements to copy to the tail of the new destination physical register"[1].

However, we are also sensitive to the fact that zeroing complicates smaller implementations of the V-extension not directed at the HPC market and as such it is regarded as undesirable.

Thus we suggest, that the possibility of zeroing the tail exists in a way that adds little burden if there is no intent to support zeroing.

This proposal adds the following architectural changes:

Add a new 2-bit field in bits 9:8 of the the vtype CSR called vtail with the following meaning

00 preferred behaviour of the implementation (either undisturbed or zeroing)
01 undisturbed tail
10 zeroing tail
11 (reserved)

Add a new (unprivileged) RO CSR called vtaildefault. Bits 1:0 of such CSR state what is the preferred behaviour of the implementation and its values can only be 01 or 10.

An implementation of V-ext, under this proposal, must always implement undisturbed tail, so the minimal implementation of this proposal simply hardcodes vtaildefault to 01. The reset state of an implementation always sets vtail to 00.

Changing the tail behaviour can be done using vsetvli:

If no tail behaviour is specified the preferred behaviour is used (i.e. vtail is left as 00)
```
vsetvli x1, x2, e64
vsetvli x1, x2, e64,m1
```
If the operand u appears after the length multiplier, undisturbed tail is chosen and vtail is set to 01.
```
vsetvli x1, x2, e64,m1,u
```
If the operand z appears after the length multiplier, zeroing tail is chosen and vtail is set to 10
```
vsetvli x1, x2, e64,m1,z
```
If the implementation does not support zeroing, the vill bit of vtype is set.

Execution of an instruction then honours the tail behaviour in vtail:

The tail elements during a vector instruction’s execution are the elements past the current vector length setting.

When vtail = 01 the tail elements do not raise exceptions, and do not update any destination vector register group.
When vtail = 10 the tail elements do not raise exceptions, but do zero the results in any destination vector register group.
When vtail = 00 the implementation behaves either as vtail = 01 or vtail = 10.

Note: under this proposal, when vtail = 00 software may not rely on the actual contents of the tail of the destination vector register group unless it knows, beforehand, the preferred behaviour of the implementation, for instance having read vtaildefault.

[1] https://github.com/riscv/riscv-v-spec/blob/master/v-undisturbed-versus-zeroing.adoc

vbx-glemieux commented 4 years ago

This seems quite clever, and I do like it. I do have some further thoughts which I'll add below... some of them involve replacing or changing this proposal in more radical ways.

(1) Consider adding "tail undefined" as one of the mode options. This is easier to implement than tail zeroing, but gives the same performance benefits. (I know of the information/security risks, but in some applications, particularly lightweight embedded use, that is not an issue.) Within the "tail undefined" mode, there can be a recommendation that "implementations which care about information leakage should write zeros to the tail under this mode" (note: I would still consider keeping the "tail zeroed" mode, though if you read below I can't see a direct programmer use for it so it could be considered for removal)

(2) This proposal assumes there is only one way to preserve tail elements in OoO, where reading the tail of the destination is necessary so it can be rewritten to the physical register being renamed to vd. This extra readout imposes overhead, which is undesired because it impacts performance -- in fact, that's the entire motivation for this proposal. However, there are alternative ways to implement this -- I have one alternative in mind that has a lower hardware cost to implement than tail zeroing, still allows OoO and retains its performance benefits, and yet leaves the original tail elements unmodified. That is, I don't think this proposal is absolutely necessary to achieve performance, unless you specifically want the feature of tail zeroing and you are fixated on that being the only way to achieve performance.

(3) Since this proposal introduces additional state, and requires portable software to provide multiple implementations depending upon the underlying microarchitecture, I believe it is prudent to look at other ways of doing this.

(4) The programmer shouldn't have to write two sequences, one for performance W on architecture family X and another for performanze Y on architecture family Z, yet that's what will happen if this is adopted.

(5) The proposal is designed and written from the perspective of a computer architect, who are the minority of users/readers of the spec. I believe it should be designed and written for programmers, who will be using the spec constantly to produce new programs. The ISA is a contract to those users, not to the microarchitectural designers.

From this new point of view, the programmer is thinking "which mode is the most natural for the code sequence that I am going to write?". Eg, the programmer wants to say "the code below depends upon the tail being zeroed", or "the code below depends on the tail remaining undisturbed" or "the code below does not depend upon the tail". The expected use:

(a) Most code sequences will use the last option, where it does not depend upon the tail so it doesn't care which option.

(b) In only some (presumably short) code sequences will it want the tail to remain undisturbed; in these cases, the code is shorter and likely faster (less data copying, less register spilling, etc) if the tail is left undisturbed.

(c) Although I can't think of any cases where the programmer wants to say "the code below depends on the tail being zeroed", I suppose it might be possible.

Why does this perspective matter? Because I don't think the ISA should expose too much about "runtime modes" for performance. These quickly become stale when there are new ways to do things. Once the programmer's intent is known, the underlying microarchitecture can do what it wants to implement that intent with maximum performance.

I don't think it's a clean design to ask the programmer to change the code sequence depending upon the underlying microarchitecture (though that will happen regardless, we should try to limit how often such divergences may occur by more carefully designing the spec).

I have been thinking of another way to achieve this without having modes. For example, do we need to define the behaviour of certain instructions differently (eg, most instructions zero the tail, but these small number of instructions -- such as vslide -- preserve it?). Or, do we need to add an instruction or two, perhaps a "tail copy" instruction that gets macro-op fused with its successor?

Guy

On Wed, Feb 5, 2020 at 3:51 AM Roger Ferrer Ibáñez notifications@github.com wrote:

We (at the BSC) are aware of the trade-offs described for implementations when it comes to choose undisturbed tail or zeroing tail. However we believe the option of implementing zeroing of the tail in the V-extension has to exist. In particular for implementations tailored for the High-Performance Computing (HPC) market, undisturbed tail poses a problem for implementations using renaming "as additional cycles are required to read out old tail elements to copy to the tail of the new destination physical register"[1].

However, we are also sensitive to the fact that zeroing complicates smaller implementations of the V-extension not directed at the HPC market and as such it is regarded as undesirable.

Thus we suggest, that the possibility of zeroing the tail exists in a way that adds little burden if there is no intent to support zeroing.

This proposal adds the following architectural changes:

Add a new 2-bit field in bits 9:8 of the the vtype CSR called vtail with the following meaning

00 preferred behaviour of the implementation (either undisturbed or zeroing) 01 undisturbed tail 10 zeroing tail 11 (reserved)

Add a new (unprivileged) RO CSR called vtaildefault. Bits 1:0 of such CSR state what is the preferred behaviour of the implementation and its values can only be 01 or 10.

An implementation of V-ext, under this proposal, must always implement undisturbed tail, so the minimal implementation of this proposal simply hardcodes vtaildefault to 01. The reset state of an implementation always sets vtail to 00.

Changing the tail behaviour can be done using vsetvli:

If no tail behaviour is specified the preferred behaviour is used (i.e. vtail is left as 00)

vsetvli x1, x2, e64

vsetvli x1, x2, e64,m1

If the operand u appears after the length multiplier, undisturbed tail is chosen and vtail is set to 01.

vsetvli x1, x2, e64,m1,u

If the operand z appears after the length multiplier, zeroing tail is chosen and vtail is set to 10

vsetvli x1, x2, e64,m1,z

If the implementation does not support zeroing, the vill bit of vtype is set.

Execution of an instruction then honours the tail behaviour in vtail:

The tail elements during a vector instruction’s execution are the elements past the current vector length setting.

When vtail = 01 the tail elements do not raise exceptions, and do not update any destination vector register group. When vtail = 10 the tail elements do not raise exceptions, but do zero the results in any destination vector register group. When vtail = 00 the implementation behaves either as vtail = 01 or vtail = 10.

Note: under this proposal, when vtail = 00 software may not rely on the actual contents of the tail of the destination vector register group unless it knows, beforehand, the preferred behaviour of the implementation, for instance having read vtaildefault.

[1] https://github.com/riscv/riscv-v-spec/blob/master/v-undisturbed-versus-zeroing.adoc

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

David-Horner commented 4 years ago

I am very wary of complicating the software visible tail behaviour.

Specifically, I am opposed to a change in default behaviour to support a facility that is admittedly intended for a sub-market (HPC).

First the good characteristics.

1) Program control of tail behaviour on a (potentially) vector operation (instruction) basis.

a) Facility can enable explicit tail behaviour (unchanged or zeroed).

b) Facility uses an existing instruction needed by the code and no additional control is needed.

c) Facility uses existing vtype register for state information (tail zero or unchanged) and no other state is needed.

d) if only (a) the explict tail behaviour is utilized, (b) and (c) are sufficient.

There is no need for the |vtaildefaul| CSR, neither to set it nor interrogate it.

And the bad.

1) As proposed this could not reasonably be added post ratification.

It is too invasive, demanding EE support.

The advisement highlights the issues that pre-extension code ( in which vtail=00 is always true) must be written agnostic to tail behaviour. This would most certainly not be the case as Section 5 of the git document highlights potential software benefits of tail undisturbed

Note: under this proposal, *when **|vtail = 00|**software may not
rely on the actual contents of the tail of the destination vector
register group* unless it knows, beforehand, the preferred
behaviour of the implementation, for instance having read
|vtaildefault|.
2) The advisement also implicitly acknowledges agnostic to tail behaviour code, but provides no support to identify or utilize this.

3) It conflates two performance benefits, ooo reg-rename and software reliance on tail-zeroing.

4) It is unnecessarily complex, among others adding additional state that must be saved on context switch.

5) it re-opens the tail-zero/unchanged debate/ambiguity by providing overriding default behaviour.

6) because of the above it complicates the software ecosystem unnecessarily without providing specific benefits.

This counter proposal addresses these concerns, retaining the good points and avoiding the bad.

Counter Proposal:

|00| undisturbed tail is required
|01| code tolerates either undisturbed or zero tail (or any other behaviour in tail portion)
|10| tail zeroing is required
|11| (reserved)

This has same pros as original.

Addresses bad aspects:

2) It appears Guy is thinking along similar lines and this addresses the agnostic concern Guy raises.

For statically managed vsetvli the linkage editor will change the agnostic to "requires undisturbed" for processors that only support the base.

3) It seperates the two optimizations in settings 01 and 10.

4) No other CSR are required.

5) default behaviour is established as unambiguously as undisturbed tail is required.

6) Only the agnostic and the linkage editor change to 00 is necessarily visible to the software eco system.

1) as a result of addressing these issues, this can be defined as an extension.

On 2020-02-05 6:51 a.m., Roger Ferrer Ibáñez wrote:

We (at the BSC) are aware of the trade-offs described for implementations when it comes to choose undisturbed tail or zeroing tail. However we believe the option of implementing zeroing of the tail in the V-extension has to exist. In particular for implementations tailored for the High-Performance Computing (HPC) market, undisturbed tail poses a problem for implementations using renaming "as additional cycles are required to read out old tail elements to copy to the tail of the new destination physical register"[1].

However, we are also sensitive to the fact that zeroing complicates smaller implementations of the V-extension not directed at the HPC market and as such it is regarded as undesirable.

Thus we suggest, that the possibility of zeroing the tail exists in a way that adds little burden if there is no intent to support zeroing.

This proposal adds the following architectural changes:

Add a new 2-bit field in bits |9:8| of the the |vtype| CSR called |vtail| with the following meaning

|00| preferred behaviour of the implementation (either undisturbed or zeroing)

|01| undisturbed tail

|10| zeroing tail

|11| (reserved)

Add a new (unprivileged) RO CSR called |vtaildefault|. Bits |1:0| of such CSR state what is the preferred behaviour of the implementation and its values can only be |01| or |10|.

An implementation of V-ext, under this proposal, must always implement undisturbed tail, so the minimal implementation of this proposal simply hardcodes |vtaildefault| to |01|. The reset state of an implementation always sets |vtail| to |00|.

Changing the tail behaviour can be done using |vsetvli|:

If no tail behaviour is specified the preferred behaviour is used (i.e. vtail is left as |00|)

|vsetvli x1, x2, e64 vsetvli x1, x2, e64,m1 |

If the operand |u| appears after the length multiplier, undisturbed tail is chosen and |vtail| is set to |01|.

|vsetvli x1, x2, e64,m1,u |

If the operand |z| appears after the length multiplier, zeroing tail is chosen and |vtail| is set to |10|

|vsetvli x1, x2, e64,m1,z |

If the implementation does not support zeroing, the |vill| bit of |vtype| is set.

Execution of an instruction then honours the tail behaviour in |vtail|:

The tail elements during a vector instruction’s execution are the elements past the current vector length setting.

When |vtail = 01| the tail elements do not raise exceptions, and do not update any destination vector register group.

When |vtail = 10| the tail elements do not raise exceptions, but do zero the results in any destination vector register group.

When |vtail = 00| the implementation behaves either as |vtail = 01| or |vtail = 10|.

Note: under this proposal, when |vtail = 00| software may not rely on the actual contents of the tail of the destination vector register group unless it knows, beforehand, the preferred behaviour of the implementation, for instance having read |vtaildefault|.

[1] https://github.com/riscv/riscv-v-spec/blob/master/v-undisturbed-versus-zeroing.adoc

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/riscv/riscv-v-spec/issues/367?email_source=notifications&email_token=AFAWIKNYYHNXKC6OKDQ2YUDRBKR2JA5CNFSM4KQJWSUKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4ILF2PTQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFAWIKOZTKFBOKZ3YKK4ZLDRBKR2JANCNFSM4KQJWSUA.

David-Horner commented 4 years ago

I left some details unspecified.

An additional operand is needed for vsetvli to specifiy the "tail agnostic" setting. I would recommend that u and z not be used for tail undisturbed and zero respectively but rather a longer acronym that more clearly states the setting. I agree that absence of all these operand values should mean the tail undisturbed default. As a strawman I propose tailag, tailun, and tailzero as hopefully self explanatory acronyms.

I'm OK with vill set if vtail is unsupported. However, I believe we should provide a note that emphasizes that implementations are free to trap on vsetvl{i}

on any specific combination of vtype values
to trap conditionally (typically with a CSR configuration setting)
recommend to not trap on implementation supported vtype settings
and highlight the benefit of immediate trapping (either explicitly itemizing and explaining or referencing another document). [ some of those being: definitive identification of instruction, allowing precise determination of arguments, which in turn allows setting an equivalent vtype (in a typical case substitute tailun for tailag)]

Some open questions and observations: Although, an implementation could conceivably only support tailag and tailzero, emulating tailag via traps on the non-vsetvl{i}, are we wanting to mandate native tailag support?

vtailag only implies tail non-undisturbed/unzero tail options are possible. Do we want to explicitly permit?

Guy appears to recommend that tailag be the default. I like this approach it aligns with programmer thinking/awareness. And it can be done in software (as explained in the prior note with linkeditor support). It does make the software model more complex. We provide a facility that in reality does not exist on some (perhaps most) implementations.

I see no way to reasonably make tailag the default hardware standard. We anticipated that programmers will want explicit behaviour (tailun or tailzero) as there are software benefits to them. And as we believe software most benefits tailun we chose that as the (hardware) default.

We can formulate this proposal as an optional extension to the base..

aswaterman commented 4 years ago

At risk of stating the obvious, the granularity at which registers are renamed is an implementation choice, and extra rename state can eliminate this overhead. For a really big machine with, say, a 1024-bit datapath and VLEN=4096, renaming at the 1024-bit granularity avoids the need to spend extra cycles copying old values. The rename table snapshots are not inconsiderable in this case (~1 Kbit apiece), but compare that to the ~256 Kbit PRF.

Have you got a specific machine configuration in mind where this is truly problematic?

On Wed, Feb 5, 2020 at 3:50 AM Roger Ferrer Ibáñez notifications@github.com wrote:

We (at the BSC) are aware of the trade-offs described for implementations when it comes to choose undisturbed tail or zeroing tail. However we believe the option of implementing zeroing of the tail in the V-extension has to exist. In particular for implementations tailored for the High-Performance Computing (HPC) market, undisturbed tail poses a problem for implementations using renaming "as additional cycles are required to read out old tail elements to copy to the tail of the new destination physical register"[1].

However, we are also sensitive to the fact that zeroing complicates smaller implementations of the V-extension not directed at the HPC market and as such it is regarded as undesirable.

Thus we suggest, that the possibility of zeroing the tail exists in a way that adds little burden if there is no intent to support zeroing.

This proposal adds the following architectural changes:

Add a new 2-bit field in bits 9:8 of the the vtype CSR called vtail with the following meaning

00 preferred behaviour of the implementation (either undisturbed or zeroing)

01 undisturbed tail

10 zeroing tail

11 (reserved)

Add a new (unprivileged) RO CSR called vtaildefault. Bits 1:0 of such CSR state what is the preferred behaviour of the implementation and its values can only be 01 or 10.

An implementation of V-ext, under this proposal, must always implement undisturbed tail, so the minimal implementation of this proposal simply hardcodes vtaildefault to 01. The reset state of an implementation always sets vtail to 00.

Changing the tail behaviour can be done using vsetvli:

If no tail behaviour is specified the preferred behaviour is used (i.e. vtail is left as 00)

vsetvli x1, x2, e64

vsetvli x1, x2, e64,m1

If the operand u appears after the length multiplier, undisturbed tail is chosen and vtail is set to 01.

vsetvli x1, x2, e64,m1,u

If the operand z appears after the length multiplier, zeroing tail is chosen and vtail is set to 10

vsetvli x1, x2, e64,m1,z

If the implementation does not support zeroing, the vill bit of vtype is set.

Execution of an instruction then honours the tail behaviour in vtail:

The tail elements during a vector instruction’s execution are the elements past the current vector length setting.

When vtail = 01 the tail elements do not raise exceptions, and do not update any destination vector register group.

When vtail = 10 the tail elements do not raise exceptions, but do zero the results in any destination vector register group.

When vtail = 00 the implementation behaves either as vtail = 01 or vtail = 10.

Note: under this proposal, when vtail = 00 software may not rely on the actual contents of the tail of the destination vector register group unless it knows, beforehand, the preferred behaviour of the implementation, for instance having read vtaildefault.

[1] https://github.com/riscv/riscv-v-spec/blob/master/v-undisturbed-versus-zeroing.adoc

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/riscv/riscv-v-spec/issues/367?email_source=notifications&email_token=AAH3XQSDFBP62LVXLYXPBBDRBKR2DA5CNFSM4KQJWSUKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4ILF2PTQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAH3XQQXH5YYUC7FRHISVALRBKR2DANCNFSM4KQJWSUA .

vbx-glemieux commented 4 years ago

On Wed, Feb 5, 2020 at 7:10 PM David-Horner notifications@github.com wrote:

Guy appears to recommend that tailag be the default. I like this approach it aligns with programmer thinking/awareness. And it can be done in software (as explained in the prior note with linkeditor support). It does make the software model more complex. We provide a facility that in reality does not exist on some (perhaps most) implementations.

I see no way to reasonably make tailag the default hardware standard. We anticipated that programmers will want explicit behaviour (tailun or tailzero) as there are software benefits to them. And as we believe software most benefits tailun we chose that as the (hardware) default.

I'm not recommending tailag be the default. I support tailun as the default, for the reasons already given.

However, I'm saying that most of the time the programmer would be fine with tailag. By adopting tailun, that fits with the "most of the time" use as well, and offers the performance benefits, which is why it should be default.

For the implementations that think tailzero is needed, I am pretty sure there are other implementation workarounds that do not require zeroing but still allow OoO execution. So, I do not advocate including tailzero as an option for implementation reasons. I do support it if we can justify it at the software/programming level as an important use case. I would like to see this use case.

If such a use case exists, then I suggest we think about a more restrictive form of tailzero. Eg, instead of writing all tails with zeros, perhaps we keep tailun behaviour but add a single unary instruction called "vtailzero" that zeros all elements at positions from vl to VLMAX-1 in the specified register or register group. This achieves the desired software functionality, which can be implemented efficiently through fusion with a preceding or successor instruction, and does not require implementing any special way of writing 0s for practically all executed instructions (just those tagged with vtailzero).

And, like Andrew has just posted, tailzero option is unnecessary for performance if you use an adequate renaming strategy. (again, this is why it shouldn't be in the ISA based solely on implementation performance)

Guy

rofirrim commented 4 years ago

Thanks everyone for the comments. Much appreciated. I can't answer to all of them (I'm a compiler guy actually, my colleagues will address them) but I'd like to clarify our stance that led to this proposal.

@David-Horner (counter-)proposal of having an agnostic tail is really good. Even better than my initial proposal because undisturbed is still a baseline that is always available. Now I realise that the "preferred tail behaviour" is not desirable and being able to state "I don't care what you do in the tail" is much better.

We most of the time don't care about what ends in the tail. So this is why it felt unnecessary to us to have to honour the undisturbed tail semantics. I'm aware of the cases where undisturbed is better, and those do justify always having undisturbed available. In short: undisturbed must stay.

However, we need a way to realize the agnostic case. Undisturbed is a way. Another one is zeroing. Before posting this proposal we did internally discuss having "undefined" values in the tail. But, as @vbx-glemieux already pointed out, it comes with security risks and, in a post-Spectre world, I would be extremely cautious to open new avenues in which we leak implementation details. Hence zeroing seemed reasonable.

In the line of what @David-Horner suggested, we might simplify even further the proposal by making vtail a single bit that means either "requires undisturbed in the tail", "doesn't care what goes in the tail". For the latter case, we can make implementation-defined what goes in the tail. Zeroing is a valid implementation, so are undisturbed and undefined (modulo security issues), even "all ones" (not necessarily meaningful) would be a valid implementation.

Under that simplified angle, a vtailzero instruction doesn't seem necessary to me.

Kind regards.

solomatnikov commented 4 years ago

This was discussed in WG meeting and as @aswaterman pointed out is not actually a problem for OOO processor with renaming. Renaming has to be done at least at the granularity of individual vector register, not vector register group, limiting performance impact. Of course, renaming can be done at finer granularity.

opalomar commented 4 years ago

Hi all,

this is Oscar from the BSC hardware team. Thank you all for the comments and suggestions. This is helpful, there are many valid and interesting points.

We are aware that it could be possible to use the renaming table rather than copying values to implement undisturbed tail. In the example from @aswaterman, that is a perfectly reasonable solution.

However, this does not work well for us. In our case, we have a VLEN=25664=16384, and a data path of 64 8 lanes = 512. That means having 32 individual "sub-renamings" per register, so the overheads are much larger than in the example @aswaterman posted, and doing this would complicate the renaming logic of our design a lot. There are multiple ways to implement this, but freeing registers would be complex, and we may need indirection in accessing the physical register. I believe we may loose many advantages of vectors with this approach.

I hope this clarifies why we are not comfortable with the current status. We are happy to implement in our design support for undisturbed tail with lower performance, we are simply looking for room in the specification to allow executing instructions at higher performance (when undisturbed tail isn't necessary).

David-Horner commented 4 years ago

The discussions in WG were in the context of the appropriate default tail behaviour. Agnostic was dismissed early in the tail default discussion. The general consensus (as I perceived it) was that deterministic behaviour was preferred, for numerous philosophical and perception/practical reasons. The remaining choices were zero or undisturbed for the default.

With undisturbed, software will opportunistically use such regions to avoid spills, etc. And there are some algorithms that could benefit from processing successively from small vl to larger vl keeping the tail data intact.

However, a large body of code can be written that is tail contents agnostic. This code provides an opportunity to optimize hardware. Specifically, unmasked vector operations do not need to read the target register set if agnostic is enabled.

The vector ISA has catered to simple implementations; with among other things restrictions on overlapping source and destination registers. It is difficult to assess how large a systematic bias is present favouring simple and other "mainline" implementations. The WG discussion noted that masked operations do need to read the target set, so support of undisturbed was "almost free'. That is not true when multiple work units are enable to provide parallel operations. Most work units could be unmasked optimized and thus save substantial real estate, complexity and power without "tail undisturbed" .

Roger's minimal single bit proposal targets this possibility. It could be deferred as it can be implemented as a standalone extension. However, if the whole ecosystem, and not just HPC sub-market is to benefit from tail agnostic code, support needs to be provisioned early on.

I suggest we allocate currently unused bits in vtype as reserved for custom use. Of the total 11 bits mapped into vsetvli, 7 are in use with lmul/sew/ediv. Of the 4 remaining I suggest 1 bit (bit 10 in vtype) be reserved for custom use. This is a 25% of the remaining vsetvli directly addressable bits. I further suggest we allocate bit 11 for custom use. This will allow a single load immediate to provide both lower custom use bits. And further I suggest we allocate from the vtype [30:12] further bits for custom use. Doing so now will allow proposals early on in this current design and testing phase that may be incorporated as their merit is empirically assessed.

David-Horner commented 4 years ago

Miscellaneous thoughts:

Appropriateness of bit 11 of vtype as custom use - bit 11 is particularly problematic to use with vsetvl, at least as part of the standard. A load immediate will generate a negative value which would set vill when used with vsetvl. To generate the bit 11 itself requires more than one instruction. However, as a custom bit, setting it can have any meaning , including

ignoring the vill
zeroing bits [31:12]
complimenting all the bits (or just those [31:12]) Indeed behaviour that would make the use load immediate viable but would not be acceptable in the standard vector extension.

Renaming can help mitigate the tail undisturbed cost but does not eliminate it. Again, in the context of determining the default tail behaviour this mitigation helped tip the scale to tail undisturbed . However, increased granularity causes non-linear increased overhead. If vl is not at the end of a granule the partial fill of the tail elements requires a read of the target register that might otherwise not be needed (and definitely not if an unmasked instruction).

tailag as default software stance Tail undisturbed is established as the default hardware behaviour. Software however does not need to assume that. Rather it is better to have software assume that the hardware does not preserve tail data unless explicitly requested. I believe this is the corollary to guy’s statement:

From this new point of view, the programmer is thinking "which mode is the most natural for the code sequence that I am going to write?". Eg, the programmer wants to say "the code below depends upon the tail being zeroed", or "the code below depends on the tail remaining undisturbed" or "the code below does not depend upon the tail". The expected use:

(a) Most code sequences will use the last option, where it does not depend upon the tail so it doesn't care which option.

(b) In only some (presumably short) code sequences will it want the tail to remain undisturbed; in these cases, the code is shorter and likely faster (less data copying, less register spilling, etc) if the tail is left undisturbed.

If we can succeed in encouraging through the software ecosystem this mindset and code annotation, then all the code can benefit from BSC’s proposed extension.

kasanovic commented 4 years ago

My experience is that tail undisturbed is useful behavior in some common idioms, including reductions, while tail zeroing is rarely useful, so I'd agree that if we're going to support options that the default is undisturbed and the option can be "don't care".

I don't actually believe there are security implications to "don't care", the context switch code should zero/save/restore the state explicitly regardless. The don't care state will hence come from the same security domain.

There are software portability concerns however. Bugs won't be portable between systems with the same VLEN, which will irritate programmers who like to see their bugs' behavior preserved reproducibly, including migrating between big/llittle cores with the same VLEN but different microarchitectures.

The renaming cost is purely a cost/performance tradeoff. There is no need to rename at the granularity of a single beat of vector execution. For the BSC machine, it's OK to only rename at the granularity of every four beats (2048 bits), In rare cases where there's enough vector issue bandwidth and enough vector ILP to otherwise fill functional unit pipelines with different instructions every clock cycle there could be some slowdown from coarser execution granularity, but for example, NVIDIA GPUs always execute four beats atomically (last systems I checked). I don't believe renaming at sub-register granule has any additional complexity over renaming to handle LMUL (not saying it isn't complex, just not additional complexity beyond storage). Finer-grain renaming can also have benefits, with effectively more rename registers at smaller vector lengths.

I understand the desire to avoid this hardware cost, but there is a software ecosystem cost here too.

jnk0le commented 4 years ago

I belive that even in said granularized renaming, the zeroed blocks could be renamed to some kind of virtual zero register giving a bit less PRF pressure.

As long as the agnostic behaviour is contained within an "if unsure, don't use it" configuration, we should be fine with it. I also agree that only 1 bit for proposed vtail is suffcient.

billhuffman commented 4 years ago

I wonder if it would help the issue the BSC folks have if we allowed VLEN to be variable instead of fixed. I assume a WARL field in a register that in most designs would be hardwired to one value, but in some designs could hold several values. For codes with shorter vectors, a smaller VLEN could be programmed. This might not be reasonable unless the granularity of short vs. long vector use was fairly large.

kasanovic commented 4 years ago

Variable VLEN could easily be added as a feature at supervisor level without affecting unprivileged spec. Having as unprivileged feature is also possible, though I wonder how really useful given that it would be a global setting across whole program. More local VLEN setting should just be done via "vl".

billhuffman commented 4 years ago

Supervisor level seems too heavy-weight. The compiler knows when there are no live vector registers and can change VLEN then. Shorter vector registers work with the same binary code but reduce tail copying. Changing vl still leaves the tail copying to be done in hardware.

I'm only thinking of this for machines with very long VLEN. Most machines would support only one VLEN value.

hanna-kruppe commented 4 years ago

Compiler support for runtime-variable vector sizes is far from trivial, even if you make concessions about where the changes can occur (e.g., only on function entry/exit). It's not enough to know whether there are live vector registers, any live value anywhere (registers or memory) of any type (e.g. scalar integers computed from the vector size) can be a problem. It is far from trivial to adapt compiler IR(s) to keep track of & control these vector-size-dependent values, and it's dubious if it's worth the complexity and engineering effort. I developed a concrete proposal for how do achieve it in LLVM back when RISC-V V practically required it (spec versions before 0.6, IIRC) and it was still a very invasive proposal despite my best efforts to make it as acceptable as possible for the rest of the LLVM community.

These problems are also why LLVM explicitly does not support using the equivalent SVE feature in this way. Starting different processes with different vector sizes is fine, of course, but changing the vector size of a running process is not supported. (I don't know how much discussion about this happened in GCC but I am rather sure that the problems described before are just as hard there.)

Besides, I am skeptical how much making VLEN variable would help with the problem at hand. While large VLEN is unnecessary for some workloads (in particular, for loops with few iterations), this can't always be predicted at the time the code is written/compiled, and in other cases (e.g. when you need two or three iterations of strip-mining at maximum VLEN) it's not clear how good of a trade-off it is to increase dynamic instruction count just to reduce tail copying.

kasanovic commented 4 years ago

I agree with @hanna-kruppe's analysis, which is why I suggested to only really support at privileged levels where it is more of an emulation support mechanism rather than a performance optimization. The regular vl-setting mechanism should handle dynamic run time lengths.

billhuffman commented 4 years ago

In that case, I withdraw my "I wonder if..." about varying VLEN.

kasanovic commented 4 years ago

I believe the tail-agnostic design cannot actually help a renamed vector register implementation with long temporal vectors, because of security concerns (my comment above regarding security was incorrect).

The major optimization that motivates the tail-agnostic option is to avoid having to write the tail elements of the new physical destination vector register when not supporting sub-vector-register renaming. Another advantage is to avoid needing to read values from the old physical destination vector register.

However, we cannot allow implementations to simply not write to the tail of a new physical destination vector register allocated off the vector physical register free list. The new physical register can potentially contain data from some other context.

The security challenge is that the privileged layer is not able to clear this hidden microarchitectural state on a context swap in a deterministic non-microarchitecture dependent way. Having some way to explicitly clear the free physical register pool state is a possibility, but is architecturally messy. Note that regular whole vector register save/restore is sufficient to avoid this security hole for non-tail-agnostic machines.

I think this means we have to require tail-agnostic must be strictly either tail-undisturbed or tail-zeroed. But to make tail-zeroing efficient on long temporal vector registers requires the sub-vector-register renaming support anyway. Zeroing is actually worse than undisturbed in this case as all tail sub-vector-register units have to be renamed to point to single zero register, versus just left alone.

Zeroing does also avoid reading the old physical destination register values, but this is only an issue for the last active sub-vector as otherwise sub-vector renaming avoids the copy. Zeroing also provides a little more effective rename register capacity.

So, I think tail-agnostic does not actually really save much over tail-undisturbed, even for long temporal vector registers that are renamed, and might be worse if requiring tail-zeroing and so we should only support tail-undisturbed in the standard.

Providing a non-standard extension that allows state to leak between contexts would be an option.

David-Horner commented 4 years ago

TL;DR Suggest alternatives to avoid data/state leakage. Suggest idea of tail-avoidance and tail-disabled approaches for tail-agnostic. Suggest Not a Value (NaV) as also supported internal "value" for tail-agnostic support. Propose individual "valid-length" approach to manage tail-agnostic support using above ideas. Suggest if tail-agnostic is a non-standard extension support be provided. Consider #369.

@kasanovic

The major optimization that motivates the tail-agnostic option is to avoid having to write the tail elements of the new physical destination vector register when not supporting sub-vector-register renaming.

Avoiding the maintenance of sub-vector-register renaming. The win-win is to do both.

Another advantage is to avoid needing to read values from the old physical destination vector register.

Agreed. Also part of a win-win-win solution.

we cannot allow implementations to simply not write to the tail of a new physical destination vector register allocated off the vector physical register free list.

Agreed. Any tail allocated register component must be vetted or avoided.

The security challenge is that the privileged layer is not able to clear this hidden microarchitectural state on a context swap in a deterministic non-microarchitecture dependent way.

Agreed.

However, the microarchiture could vet the vector physical register free list on each return from interrupt/exception. There may be many ways to do this that are not currently viable but become trivial with support for domain identification/tracking..

An approach: By marking the elements in the list as unclean on a return from more priv mode, and marking as clean either by ensuring the specific register allocation will be fully overwritten by the vector operation (a common situation) or when explicitly "cleansed" (with zero or otherwise, and this can occur during interrupt/exception return even before first vector instruction is scheduled.

A preferred method would be one that is inherent in normal operations , with no additional internal state for problem avoidance but primarily optimization use.

Note that regular whole vector register save/restore is sufficient to avoid this security hole for non-tail-agnostic machines.

Agreed.

I think this means we have to require tail-agnostic must be strictly either tail-undisturbed or tail-zeroed.

By "tail-zeroed" I believe you mean tail-fill with vetted data. I.e. Data from architectural registers. e.g. The tail fill value could be the last written element value, or derived from one or either of the source registers. Basically whatever is convenient/optimal for the micro-architecture.

Here's where I disagree. There are more options than those two.. Specifically tail-avoidance and tail-disabled are possible. And some of these can yield the win-win-win benefit sought above.

One idea is closely related to the variable VLEN suggestion. Architecturally visible VLEN changes are problematic. But within the micro-architectural , specifying a VLEN per register could establish a tail-management avoidance or disabled approach. This is especially valuable for the use case described above.

I here propose a possible micro-architecture implementation:

A "valid-length" (count of valid segments) defines the segments that are fully defined, segments beyond this are the agnostic tail. This is internal state information within the vector processor.

Especially if such elements are considered not an element, That is, not only the value could be anything but it is also acceptable to consider the values “invalid”; i.e. Not a Value (NaV).

Access to tail segments could do any or all of the following

increase a performance counter
trap and with extra exception information complete the operation, perhaps re-executing with modified vl. (consider vmax/min and agnostic data)
provide constant data (like zeros)
provide data from the last valid segment (reasonable if processing is always performed sequentially) This could be provided to the ALU, or directly provided to the
provide data from the first segment (which in some designs will always be valid) and is consistent with the scalar processing. This is appropriate when parallel operations such as reductions are in play.
for the duration of the instruction, use an effective vl of the min of vl and the 2 sources’ valid-lengths. For instructions running under tail-agnostic the valid-length is set based on this effective vl. When running under tail-undisturbed everything from the effective vl could be undisturbed which would also benefit non-rename implementations. This approach is appropriate to many operations (especially non-masked versions) in which the destination value is fully determined by the sources.

I realize NaV can introduce ambiguity and inconsistency, but if it can be tamed it could provide for meaningful optimizations. Special cases of xor/and/or register with itself are often identified and optimized so anomalous behaviour is avoided.
Many of the possibilities when consistently applied (especially 5) do not cause anomalies to arise. And a judicial definition of tail-agnostic is sufficient to allow option 6. Something along the lines of : Successive agnostic operations combine their field of agnostic behaviour. Further, NaV to some extent is a superposition that collapses with specific operations. E.g. An OS write back of a Whole Register Read will reset the valid-length of the register. As a result interrupt can cause specific values different from those without the interrupt, but they could still be consistent with the range of values allowed by tail-agnostic.

Providing a non-standard extension that allows state to leak between contexts would be an option.

To me it appears that viable, reasonable, performant and practical implementations of tail-agnostic implementation that do not leak state are possible.

As a result I believe we should continue to consider tail-agnostic in the standard.

However, Providing a non-standard extension that DOES NOT allow state to leak between contexts could also be an option. And if so, #369 was presented for this purpose.

rofirrim commented 4 years ago

An effect of tail-undisturbed is that now vector operations have logically an extra operand that represent the values of the tail element. A compiler will have to assign this extra operand to the destination register of the vector operation (even to an "undefined" value when the code doesn't actually care about the tail).

Masked intrinsics in the compiler often have a "merge" (or "dest") operand that states what are the values of the inactive elements. It looks very similar to the tail-undisturbed situation.

The way I see it, however, is that the vector length represents the logical extent of the vector being processed. The mask does not represent such extent but a subset of it. As the inactive elements are still inside that logical extent, it makes sense to give them a value, hence the "merge" operand.

There is value in tail-undisturbed for algorithms that accumulate partial results on a register. I'm worried however, that this is just the only case where tail-undisturbed is actually needed and there are many other instances where the tail behaviour is not relevant. Being able to communicate this fact to the architecture seems beneficial.

That said I understand that mandating the possibility of zeroing can be a burden for smaller implementations. Maybe we can turn this into a, say, Zvzerotail extension of V-ext, that adds one bit in vtype to express the desired zeroing behaviour.

At the level of assembly it could look like this

# Tail-undisturbed (base, always valid)
vsetvli x1, x2, e64
vsetvli x1, x2, e64,m1

# Tail-zeroing (only valid under Zvzerotail)
vsetvli x1, x2, e64,m1,z
# Tail-undisturbed (only valid under Zvzerotail)
vsetvli x1, x2, e64,m1,u    # alias of `vsetvli x1, x2, e64,m1`

opalomar commented 4 years ago

A few comments from a HW perspective:

"But to make tail-zeroing efficient on long temporal vector registers requires the sub-vector-register renaming support anyway. "

In order to implement tail-zeroing, we consider that there are alternatives to sub-vector-register renaming (e.g. keep internally the vector length for each register, masks, ...) that will work well in architectures with long temporal vectors.

The issue with the current tail-undisturbed scheme in architectures with register renaming is that the overhead is large for long temporal vectors, since it requires copying values. This may be alleviated by the "granularised renaming" proposed earlier. It limits the amount of values to copy, and can help increase the number of rename registers. However, it has significant complexity. It requires support in the rename, issue and commit logic. In the logic reading from the register file (for example, an instruction may not start reading from the first "granule", if VSTART is larger than the granule size). It has also area overhead in the renaming logic structures.
Tail-undisturbed creates additional dependences, limiting concurrency. For example, in the sequence Load V0, Add v1<-v1,v0, Load V0, the second load artificially depends on the first one. This will happen for example, in a reduction loop. This will prevent that the second Load executes in parallel with the first one.

David-Horner commented 4 years ago

Adding tail-zeroing leads to fragmentation and overburdening hardware implementations. Adding tail-agnostic does not burden hardware implementations. They can continue to use tail-undisturbed for tail-agnostic situations.

@rofirrim I agree that there is value for the compiler to only track tail undisturbed when useful. Specifically, as you said to avoid tracking

even to an "undefined" value when the code doesn't actually care about the tail.

I do expect that tail undisturbed to be more generally useful than just for algorithms that accumulate partial results on a register. Programmers and compiler writers continue to be extremely creative in using the target ISA unique characteristics and fringe cases.

However, I want to emphasize once again that the options need not be tail-zeroing and tail-undisturbed. Instead of tail-zero, tail-agnostic does exactly what you want: allows the compiler to "not care" and not need to track tail contents.

So the vlseti constructs would be:

# Tail-undisturbed always valid to use, but use when undisturbed is intended
vsetvli x1, x2, e64,m1,tu    # alias of `vsetvli x1, x2, e64,m1`
 # Tail-agnostic also always valid to use, but use when tail values are not useful
vsetvli x1, x2, e64,m1,ta

I propose the base includes the extra bit in vtype and it be set with ta/tu even if the hardware is always tu.

jnk0le commented 4 years ago

One more agnostic approach is to zero the tail in last sub-vector and undisturb the rest of the register. It benefits in InO as well as sub renamed OoO but bugs will be even harder to port though.

David-Horner commented 4 years ago

I believe it should be considered the same agnostic approach. That is each byte in tail (or masked) are either the undisturbed destination byte or the designated fill byte (currently proposed to be x'FF'. We can expect that fill bytes will be in SEW bit groups on SEW boundaries. This meets a recommended criteria for agnostic: that only two states need to be checked for validation. Either undisturbed or designated value.

Addendum: 1) checking becomes more problematic if the granularity is less than a byte (not horribly, but significantly. 2) byte granularity allows better poisoning of the agnostic values so that accidental dependence upon -1 values or unchanged is detected. ,