Performance implications of zeroing past vl

solomatnikov commented 5 years ago

Zeroing past vl implies that vector instruction takes the same number of cycles as in case of vl==VLMAX if vector microarchitecture is limited by write port BW of vector register file.

This can be especially bad if vector code is written with LMUL==8 but used for relatively short vectors. For instance, saxpy example uses LMUL==8 and for VLEN==512, 4 lanes, 32-bit lanes/elements, every vector instruction would take 4*8 cycles because of write port(s) bottleneck, even when vl==16.

aswaterman commented 5 years ago

There's a note that addresses this issue:

"For zeroing tail updates, implementations with temporally long vector registers, either with or without register renaming, will be motivated to add microarchitectural state to avoid actually writing zeros to all tail elements, but this is a relatively simple microarchitectural optimization. For example, one bit per element group or a quantized VL can be used to track the extent of zeroing. An element group is the set of elements comprising the smallest atomic unit of execution in the microarchitecture (often equivalent to the width of the physical datapath in the machine). The microarchitectural state for an element group indicates that zero should be returned for the element group on a read, and that zero should be substituted in for any masked-off elements in the group on the first write to that element group (after which the element group zero bit can be cleared)."

It's rather annoying, but it's not expensive.

ccelio commented 5 years ago

I would expect nearly all microarchitectures to use some sort of is_zero register file that is not limited to the same bandwidth requirements as the vector register file.

solomatnikov commented 5 years ago

Why not to leave the elements past vl in undefined state, just like the whole vector register is after reset? If SW erroneously uses such elements, it would get wrong result anyway.

This should work for implementations with and without register renaming.

aswaterman commented 5 years ago

I think that’s a defensible position. Presumably the reason the TG came to this design was to avoid the additional implementation-defined behavior that would be subtly exposed by buggy software.

solomatnikov commented 5 years ago

I think the spec should be changed to allow undefined state for elements past vl. Of course, zero or previous state would be allowed too.

Of course, one bit per element group to track the extent of zeroing can be implemented but I think the overhead could be non-trivial because of fan-out of such flops.

For example, in a simplistic implementation of 512-bit wide datapath with 128 entry vector register file, there will be 128 to 1 mux with fanout of 512 across the whole datapath. The same logic has to be replicated for each read port of the vector register file (4 ports min for a reasonable design). This can be challenging for physical design.

Of course, one can do a different implementation with flops replicated for each lanes, e.g. per 64-bit lane. So, the number of flops would be 8*128 with fanout of 64 and there is still quite a bit of wiring across each lane.

I don't think it's worth doing without clear compelling reason.

jnk0le commented 5 years ago

Implementation-defined behaviour usually means security holes.

solomatnikov commented 5 years ago

Abstract generalizations like this do not make an argument. Modern processors have many parts with implementation-defined behavior and a lot more state, e.g. data caches and branch predictors, that can be security holes. Yet no sane processor designer would get rid of data caches and branch predictors because the resulting design would not be competitive.

In this case prevention of data leaks/security holes is simple - on a context switch SW has to clear all registers anyway to prevent information leaks, zeroing past vl does not help or eliminate this. How would not zeroing past vl be a security hole?

aswaterman commented 5 years ago

I agree with @solomatnikov this can be defined in a way that does not open a security hole: e.g., the unpredictable state must be a deterministic function of the architectural state that's visible to the executing privilege mode. This would permit both preserve-past-VL and zero-past-VL without permitting architecturally visible leakage from a different security domain.

I talked to @kasanovic about this today, and he said the TG's principal concern was about software inadvertently relying on the implementation's behavior. In particular: zero-past-VL and preserve-past-VL are both useful behaviors in some situations, and it's easy to imagine a software developer accidentally relying on whichever one the development machine provides. So we could end up in the situation where software runs only under one discipline or the other, risking the possibility of the adoption of a de-facto standard.

kasanovic commented 5 years ago

The data path fanout is not that bad. In design with static logic read ports from flops, the read port is just not enabled from any row, so OR-tree produces zero. The gating can be done in read port address, not on data.

kasanovic commented 5 years ago

We have discussed previously, but not specified in this version of spec, a way to "disable" the vector unit when not in use to save context-switch overhead (or even to enable power gating). This would reuse zeroing logic to clear state.

jnk0le commented 5 years ago

In this case prevention of data leaks/security holes is simple - on a context switch SW has to clear all registers anyway to prevent information leaks, zeroing past vl does not help or eliminate this. How would not zeroing past vl be a security hole?

There is possibility of leaks within the thread context (some kind of use-after-free) that can be elevated by software written and "debugged" on renamed architectures, as pointed by Andrew. If that's not enough, we can ultimately exploit a vector capable javascript JIT compilers. I think that this approach is valid, but we need to be carefull before such software is written/compiled.

vbx-glemieux commented 5 years ago

Both "write tail 0s" and "leave tail undefined" can have negative implications on storage efficiency of the vector register file, and may negatively impact performance.

Sometimes you build up a solution in smaller "chunks", and slide these chunks into a longer vector result. This is especially true if you wish to hold 2D data within a 1D register, and compute 1 new row at a time.

If the rule is "leave tail as-is", then you can do the following: set long vector length (eg, array size) ** slideup Vx by amount of short vector length (to make room near the 0th element) set short vector length (eg, row size) write a partial result to Vx (again near 0th element) This can all be done within the same register, so there are RAW hazards but it is very storage-efficient.

Without this rule, the slide operation will write 0s or leave the tail undefined. Hence, an additional register is needed to hold and restore that data after setting a longer vector length. The sequence becomes: set short vector length write result to Vx set long vector length slideup Vy by amount of short vector length into Vx copy Vx into Vy

The last step, a copy, also makes the sequence slower. It can be avoided if you unroll the loop and ping-pong between Vx and Vy as destinations in (**), but this results in code bloat.

So, the "write tail 0s" and "leave tail undefined" rules both result in use of an additional register (Vy), and may involve extra data copying (Vx into Vy).

I realize these two rules are nice because they make OOO implementations simpler. I wonder if there's a different way to do register renaming of vector data which alleviates this issue; the assumption has been that the entire named vector register is reassigned, eg a blind extension of scalar register renaming. For example, one could rename the individual elements instead, which would allow older data to stay-in-place and make "leave tail as-is" possible without a data copy. Perhaps the overhead of this can be kept in check, like doing it on an SLEN basis instead of an element-wise basis? This should allow the best of both worlds.

Guy

On Fri, May 3, 2019 at 5:47 AM jnk0le notifications@github.com wrote:

In this case prevention of data leaks/security holes is simple - on a context switch SW has to clear all registers anyway to prevent information leaks, zeroing past vl does not help or eliminate this. How would not zeroing past vl be a security hole?

There is possibility of leaks within the thread contex that can be elevated by software written and "debugged" on renamed architectures, as pointed by Andrew. If that's not enough, we can ultimately exploit a vector capaple javascript JIT compilers. That approach is valid, but we need to be carefull before such software is written/compiled.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

solomatnikov commented 5 years ago

The data path fanout is not that bad. In design with static logic read ports from flops, the read port is just not enabled from any row, so OR-tree produces zero. The gating can be done in read port address, not on data.

This is not true in general, i.e. if the vector register file is generated by a compiler.

Also, even for flop-based register file it is better to hold previous value on the output of the read port when the read port is disabled to minimize switching activity and power. I think common case is when read port is used ~50% of cycles. Forcing output to zero would double switching activity.

kasanovic commented 5 years ago

On May 3, 2019, at 10:05 AM, Alex Solomatnikov notifications@github.com wrote:

The data path fanout is not that bad. In design with static logic read ports from flops, the read port is just not enabled from any row, so OR-tree produces zero. The gating can be done in read port address, not on data.

This is not true in general, i.e. if the vector register file is generated by a compiler.

In this case there is much less fanout to worry about, especially in terms of area. Also, even for flop-based register file it is better to hold previous value on the output of the read port when the read port is disabled to minimize switching activity and power. I think common case is when read port is used ~50% of cycles. Forcing output to zero would double switching activity.

This is very pessimistic assumption in terms of switching activity. Most bits are zeros.

Krste

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/riscv/riscv-v-spec/issues/157#issuecomment-489168490, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGAASOL6PXBNSYNWWUKQ73PTRWEFANCNFSM4HJ2SZAQ.

solomatnikov commented 5 years ago

On May 3, 2019, at 10:05 AM, Alex Solomatnikov @.***> wrote: The data path fanout is not that bad. In design with static logic read ports from flops, the read port is just not enabled from any row, so OR-tree produces zero. The gating can be done in read port address, not on data. This is not true in general, i.e. if the vector register file is generated by a compiler. In this case there is much less fanout to worry about, especially in terms of area. Also, even for flop-based register file it is better to hold previous value on the output of the read port when the read port is disabled to minimize switching activity and power. I think common case is when read port is used ~50% of cycles. Forcing output to zero would double switching activity. This is very pessimistic assumption in terms of switching activity. Most bits are zeros. Krste

Is it true for floating point values?

… — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#157 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGAASOL6PXBNSYNWWUKQ73PTRWEFANCNFSM4HJ2SZAQ.

billhuffman commented 5 years ago

From my point of view, we need a rule for tail elements or we'll have software incompatibilities. "Leave as it was" is very bad for renamed designs, which leads to "zero." Then the question is whether "zero" is bad for any hardware designs.

As an alternative to Krste's OR trees for read ports, I would suggest that reading the "zero file" a cycle earlier than the vector would remove the fanout difficulties that might otherwise exist. That would allow also for easily zeroing in the last mux stage of the read port, which would avoid the switching activity Alex has mentioned.

Seems to me "zero" is the better answer and I think there are reasonable hardware structures to accomplish it.

 Bill

solomatnikov commented 5 years ago

From my point of view, we need a rule for tail elements or we'll have software incompatibilities. "Leave as it was" is very bad for renamed designs, which leads to "zero." Then the question is whether "zero" is bad for any hardware designs.

Yes, zeroing past vl adds a lot of complexity to simple implementations, which will be the majority, at least initially.

Tracking dependencies for RAW, WAW and chaining becomes significantly more complicated because single beat can write variable number of elements. And these are required for good/competitive performance.

For example, a typical vector implementation can have separate memory and arithmetic pipelines with 4 lanes and VLMAX==16. Arithmetic pipeline executes FMA with vl==16 and memory pipeline executes vector load with vl==7, writing the same vector register. Last beat of vector load writes 12 elements, so WAW check/stall logic becomes more complicated.

Segment vector loads and stores make it even more complicated because segment mem ops write or read multiple vector registers (up to 8). And segment vector loads and stores are necessary to achieve good performance for many kernels/applications.

Lots of extra complexity without clear benefit.

As an alternative to Krste's OR trees for read ports,

What @kasanovic suggested does not help with timing or fanout, it actually makes it worse.

I would suggest that reading the "zero file" a cycle earlier than the vector would remove the fanout difficulties that might otherwise exist. That would allow also for easily zeroing in the last mux stage of the read port, which would avoid the switching activity Alex has mentioned.

This would complicate dependency and chaining logic even more because reading "zero file" a cycle earlier requires a lot of special cases in the logic. Is "zero file" also written a cycle earlier? Or extra stall cycle must be added? Or special bypass?

Seems to me "zero" is the better answer and I think there are reasonable hardware structures to accomplish it.
 Bill

billhuffman commented 5 years ago

Hi Alex,

In the suggested situation, I would expect to see 128 "zero bits" and 16,384 bits of register file data (assuming an FMA is single-precision). Given that ratio of sizes, and therefore latencies from gates, control fanout, and wires, I'm not understanding how timing becomes an issue for the "zero bits." Given that each zero bit simply forces the 128 bits of regfile output to zero on read (or doesn't) and that four bits get set as a function of opcode and vl on write, I'm not understanding the complexity issue either. There are additional things that happen with LMUL and segment memory operations, but these happen first to the register file data and the zero bits follow.

  Bill

On 5/29/19 6:15 PM, Alex Solomatnikov wrote:

From my point of view, we need a rule for tail elements or we'll have software incompatibilities. "Leave as it was" is very bad for renamed designs, which leads to "zero." Then the question is whether "zero" is bad for any hardware designs.

Yes, zeroing past vl adds a lot of complexity to simple implementations, which will be the majority, at least initially.

Tracking dependencies for RAW, WAW and chaining becomes significantly more complicated because single beat can write variable number of elements. And these are required for good/competitive performance.

For example, a typical vector implementation can have separate memory and arithmetic pipelines with 4 lanes and VLMAX==16. Arithmetic pipeline executes FMA with vl==16 and memory pipeline executes vector load with vl==7, writing the same vector register. Last beat of vector load writes 12 elements, so WAW check/stall logic becomes more complicated.

Segment vector loads and stores make it even more complicated because segment mem ops write or read multiple vector registers (up to 8). And segment vector loads and stores are necessary to achieve good performance for many kernels/applications.

Lots of extra complexity without clear benefit.

As an alternative to Krste's OR trees for read ports,

What @kasanovichttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_kasanovic&d=DwMCaQ&c=aUq983L2pue2FqKFoP6PGHMJQyoJ7kl3s3GZ-_haXqY&r=AYJ4kbebphYpRw2lYDUDCk5w5Qa3-DR3bQnFjLVmM80&m=muH0Z8dBQy-0fuZseLr4LO0VIJUvdXftSZopVkhEspE&s=W6QJpLToRE9OkRB201bkKiAx3AHscWQd7mDlCq5mdyU&e= suggested does not help with timing or fanout, it actually makes it worse.

I would suggest that reading the "zero file" a cycle earlier than the vector would remove the fanout difficulties that might otherwise exist. That would allow also for easily zeroing in the last mux stage of the read port, which would avoid the switching activity Alex has mentioned.

This would complicate dependency and chaining logic even more because reading "zero file" a cycle earlier requires a lot of special cases in the logic. Is "zero file" also written a cycle earlier? Or extra stall cycle must be added? Or special bypass?

Seems to me "zero" is the better answer and I think there are reasonable hardware structures to accomplish it.

Bill

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_riscv_riscv-2Dv-2Dspec_issues_157-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAKXXKKCNU3FKEE4IBQHSRE3PX4TD5A5CNFSM4HJ2SZA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWRB5VQ-23issuecomment-2D497163990&d=DwMCaQ&c=aUq983L2pue2FqKFoP6PGHMJQyoJ7kl3s3GZ-_haXqY&r=AYJ4kbebphYpRw2lYDUDCk5w5Qa3-DR3bQnFjLVmM80&m=muH0Z8dBQy-0fuZseLr4LO0VIJUvdXftSZopVkhEspE&s=Wh0lFb1NABwPURxfm8LykpaKuw_I2wNLMutM-vTiCIo&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AKXXKKCSKWI5GECXSUEYRNTPX4TD5ANCNFSM4HJ2SZAQ&d=DwMCaQ&c=aUq983L2pue2FqKFoP6PGHMJQyoJ7kl3s3GZ-_haXqY&r=AYJ4kbebphYpRw2lYDUDCk5w5Qa3-DR3bQnFjLVmM80&m=muH0Z8dBQy-0fuZseLr4LO0VIJUvdXftSZopVkhEspE&s=dW6kHMS0dx1UyBqRmPMQ291i4TY8HbYYQRRikNqRmcI&e=.

solomatnikov commented 5 years ago

Ping @kasanovic

HanKuanChen commented 5 years ago

I think "leave tail as it was" is more convenient and make sense for software developers.

Take dot_prod as an example,

float32_t dot_prod(const float32_t *src1, const float32_t *src2, uint32_t len);

If the rule is "leave tail as it was", then use vfmacc.vv in loop and do vfredosum.vs in the end is intuitive.

    vsetvli x0, x0, e32, m8
    vmv.v.i v16, 0
    vmv.s.x v24, 0
loop:
    beqz new_vl, end
    vsetvli new_vl, len, e32, m8
    vlw.v v0, (src1)
    vlw.v v8, (src2)
    vfmacc.vv v16, v0, v8
    sub len, len, new_vl
    slli mem_offset, new_vl, 2
    add src1, src1, mem_offset
    add src2, src2, mem_offset
    j loop
end:
    vsetvli x0, x0, e32, m8
    vfredosum.vs v24, v16, v24
    # get result in v24[0]

However, if the rult is "leave tail as 0", then use vfredosum.vs in loop reduce performance.

    vsetvli x0, x0, e32, m8
    vmv.v.i v16, 0
    vmv.s.x v24, 0
loop:
    beqz new_vl, end
    vsetvli new_vl, len, e32, m8
    vlw.v v0, (src1)
    vlw.v v8, (src2)
    vfmul.vv v16, v0, v8
    vfredosum.vs v24, v16, v24
    sub len, len, new_vl
    slli mem_offset, new_vl, 2
    add src1, src1, mem_offset
    add src2, src2, mem_offset
    j loop
end:
    # get result in v24[0]

kasanovic commented 5 years ago

I agree tail-undisturbed makes this easier, but there is a better loop for the tail-zeroing case:

For the zeroing version:

vsetvli t0, a2, e32, m8    # t0 has max strip length encountered in loop
vmv.v.i v24, 0             # Zero accumulator
mv t1, t0                  # Copy to t1, current strip length

loop: vlw.v v0, (a0)
slli a5, t1, 2 # Get byte offset add a0, a0, a5 # Bump source pointer vlw.v v8, (a1)
add a1, a1, a5 # Bump source pointer vfmul.vv v16, v0, v8 # Tail elements zeroed vsetvli x0, t0, e32, m8 # Reset to max vl encountered in loop sub a2, a2, t1 # Subtract elements done vfadd.vv v24, v24, v16 # Accumulate in vector of partial sums setvli t1, a2, e32, m8 # Set vl for next strip bnez a2, loop end: vmv.s.x v0, x0 # Clear scalar accumulator vsetvli x0, t0, e32, m8 # Set number of partial accumulators to reduce fredsum.vs v0, v24, v0 # Reduce vfmv.f.s fa0, v24 # Return result ret

Given that the vectorized partial summation is not ordered, there's no reason to use the slower ordered reduction at the end.

Krste

On Mon, 02 Sep 2019 03:58:25 -0700, Han-Kuan Chen notifications@github.com said:

| I think "leave tail as it was" is more convenient and make sense for software developers. | Take dot_prod as an example,

| float32_t dot_prod(const float32_t src1, const float32_t src2, uint32_t len);

| If the rule is "leave tail as it was", then use vfmacc.vv in loop and do vfredosum.vs in the end is intuitive.

| However, if the rult is "leave tail as 0", then use vfredosum.vs in loop reduce performance.

| — | You are receiving this because you were mentioned. | Reply to this email directly, view it on GitHub, or mute the thread.*

kasanovic commented 5 years ago

Decided to go with tail elements undisturbed in 0.8

riscv / riscv-v-spec

Performance implications of zeroing past vl #157