Remove byte, half and word vector loads and stores (vlb.v etc.)

riscv / riscv-v-spec

Working draft of the proposed RISC-V V vector extension

https://jira.riscv.org/browse/RVG-122

Creative Commons Attribution 4.0 International

953 stars 271 forks source link

Remove byte, half and word vector loads and stores (vlb.v etc.) #362

Closed solomatnikov closed 4 years ago

solomatnikov commented 4 years ago

Vector loads and stores with element width should be sufficient for most use cases and have higher performance, i.e. if wider register file ports are not used for vector loads and stores

For example, if two byte vectors need to be added to produce a 16-bit element vector, then it would be faster to use two vle.v instructions and widening add.

Of course, there are corner cases in which widening loads are useful but it's not a good idea to complicate ISA unless there are compelling common case uses.

It's not clear if there is a use case for truncating vector stores - it makes more sense to do narrowing or saturating operations instead.

aswaterman commented 4 years ago

This change would simplify the memory-system datapath I'm working on, and it would also free up a bunch of opcode space.

vbx-glemieux commented 4 years ago

I fully support this simplification. It also gets rid of signed/unsigned variants. I made the same recommendation over 2yrs ago @ a RISC-V meeting.

There is a point to avoid redundant capabilities, ie both the compute datapath and load/store datapath having the same ability to change element sizes and perform sign extension or truncation. Unfortunately, the compute datapath has very few instructions that can actually use this additional hardware; with the freed opcode space, perhaps more element-resizing compute instruction variants can be added (which should not increase datapath area significantly).

New "widening move" instructions (2SEW = SEW, 4SEW=SEW) which can be macro-op-fused with a preceding compute instruction would allow better orthogonality. (A widening add here wouldn't work, since fusing is mch easier when you don't need to read out an additional source operand or need to do an addition.)

Guy

On Mon, Jan 20, 2020 at 7:05 PM Andrew Waterman notifications@github.com wrote:

This change would simplify the memory-system datapath I'm working on, and it would also free up a bunch of opcode space.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/riscv/riscv-v-spec/issues/362?email_source=notifications&email_token=ABMAPWJGARDB22BY4A4QBITQ6ZRAZA5CNFSM4KGYR7Z2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJOKWJQ#issuecomment-576498470, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMAPWIA6GDONLP66QQ2TFTQ6ZRAZANCNFSM4KGYR7ZQ .

David-Horner commented 4 years ago

I agree with guy that widening (and narrowing) move instructions are extremely valuable to replace the functionality lost by removal of byte, half and word vector loads and stores.

David-Horner commented 4 years ago

Discussion so far:

From minutes 20200129:

Proposal: to drop the fixed-size vector load/store instructions, leaving only the SEW-sized load/store instructions. Dropping these instructions would save considerable complexity in memory pipelines. However, dropping support would also require execution of additional instructions in some common cases. A remedy would be to add more widening or quad-widening (quadening) compute instructions to reduce this impact.

The current design uses constant SEW/LMUL ratios to align data types of different element widths. If only SEW-sized load/stores were available, then a computation using a mixture of element widths would have to use larger LMUL for larger SEW values, which effectively reduces the number of available registers and so increases register pressure. The fixed width load/stores allow, e.g., a byte to be loaded into a vector register with four-byte width with LMUL=1 so avoids this issue.

From minutes 20200129:

Discussion covered the new proposal to provide a fractional LMUL to avoid one drawback of dropping these memory operations. The general consensus was in favor, as the scheme also aids mixed-width floating-point arithmetic (for which there is no equivalent to the widening/narrowing load/stores). The proposal requires adding a new bit to LMUL in vtype, so that a new instruction will not be required.

vbx-glemieux commented 4 years ago

I think we need to add a related topic to this discussion.

Currently, the only way to change data layout in the register file, is by using back-to-back vload/vstore instructions. That is, if you wish to change the SEW/LMUL ratio, the previous data needs to be rearranged to match the new ratio. (At least, I can't figure out a way to do this without going through the memory system.)

I think we need a new vector move instruction that is like a typecast, converting from one vtype to another. When the SEW/LMUL ratio changes, this is a data move. When the ratio is the same, it is a NOP. The destination vtype would be the current CSR vtype; the source vtype could be specified in the same way as vsetvl{i}.

How is this related to removal of byte/half/word loads/stores? Because back-to-back vload/vstores can also be used to change element sizes, which the new vector move would also support (eg zero/sign extending or truncating).

Guy

On Thu, Feb 13, 2020 at 1:53 PM David-Horner notifications@github.com wrote:

Discussion so far:

From minutes 20200129:

Proposal: to drop the fixed-size vector load/store instructions, leaving only the SEW-sized load/store instructions. Dropping these instructions would save considerable complexity in memory pipelines. However, dropping support would also require execution of additional instructions in some common cases. A remedy would be to add more widening or quad-widening (quadening) compute instructions to reduce this impact.

The current design uses constant SEW/LMUL ratios to align data types of different element widths. If only SEW-sized load/stores were available, then a computation using a mixture of element widths would have to use larger LMUL for larger SEW values, which effectively reduces the number of available registers and so increases register pressure. The fixed width load/stores allow, e.g., a byte to be loaded into a vector register with four-byte width with LMUL=1 so avoids this issue.

From minutes 20200129:

Discussion covered the new proposal to provide a fractional LMUL to avoid one drawback of dropping these memory operations. The general consensus was in favor, as the scheme also aids mixed-width floating-point arithmetic (for which there is no equivalent to the widening/narrowing load/stores). The proposal requires adding a new bit to LMUL in vtype, so that a new instruction will not be required.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/riscv/riscv-v-spec/issues/362?email_source=notifications&email_token=ABMAPWNRBXWEXT4ZQ6UA7MDRCW6OJA5CNFSM4KGYR7Z2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELWYACA#issuecomment-585990152, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMAPWKWQ3H47IOWVF3GVULRCW6OJANCNFSM4KGYR7ZQ .

David-Horner commented 4 years ago

I was struggling with a related (and perhaps the same issue) of splitting one LMUL register set into two with the same SEW but LMUL = (original LMUL) /2

This can be done by a vstore of the full register group and 2 consecutive (in memory) vloads at 1/2 the LMUL all with the same SEW.

However, I was wanting to avoid the memory references.

So what I came up with is successive vshift and vor. It's ugly. If there is something better that is efficient, great. If not specific hardware assist sounds useful for the case I was considering.

I think amalgamating 2 LMUL register groups into a single LMUL2 might be simpler, but I haven't worked through that yet.

vbx-glemieux commented 4 years ago

David, not only is the code that does not depend upon vload/vstore ugly, but I believe it is very difficult to generalize for all possible SLEN values. My proposal allows creation of portable software that does not depend upon SLEN settings and does not require use of the cache or memory system.

Adding a vmove that does typecast/data conversion instruction comes mostly for free.

If actually doing a move (source/dest differ), you can just use the provided vtype to control the address generator/control logic for the associated register read port. That hardware already exists, you are merely activating it, and it gets rewritten to the correct destination in the correct format.

If the data doesn’t move (section/dest the same), you may be able to attach the vtype property to the register / register group so it gets activated every time that register/register group is read from, up until the point it is redefined. This would require new dedicated logic, but it is only for high performance / low power implementations.

There may be increased cost when width conversion and sign/zero extension or truncation are done at the same time. I think most of this hardware already exists in the data path for widening and narrowing instructions, but it also needs to be activated. I haven’t thought this part through as thoroughly.

Guy

On Thu, Feb 13, 2020 at 3:43 PM David-Horner notifications@github.com wrote:

I was struggling with a related (and perhaps the same issue) of splitting one LMUL register set into two with the same SEW but LMUL = (original LMUL) /2

This can be done by a vstore of the full register group and 2 consecutive (in memory) vloads at 1/2 the LMUL all with the same SEW.

However, I was wanting to avoid the memory references.

So what I came up with is successive vshift and vor. It's ugly. If there is something better that is efficient, great. If not specific hardware assist sounds useful for the case I was considering.

I think amalgamating 2 LMUL register groups into a single LMUL2 might be simpler, but I haven't worked through that yet.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/riscv/riscv-v-spec/issues/362?email_source=notifications&email_token=ABMAPWJUFAHJCVJ2WRTDG23RCXLJ7A5CNFSM4KGYR7Z2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELXA6OA#issuecomment-586026808, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMAPWK72CGVKBZ6SGYL6DTRCXLJ7ANCNFSM4KGYR7ZQ .

vbx-glemieux commented 4 years ago

Actually, I take back the mostly free statement. I don’t think it’s true — data needs to move across lanes, and since we are changing SEW/LMUL that isn’t required by any other operation, the hardware will have to be newly added.

I was mistakenly thinking only about data size conversions with constant SEW/LMUL.

Guy

On Thu, Feb 13, 2020 at 4:01 PM Guy Lemieux glemieux@vectorblox.com wrote:

David, not only is the code that does not depend upon vload/vstore ugly, but I believe it is very difficult to generalize for all possible SLEN values. My proposal allows creation of portable software that does not depend upon SLEN settings and does not require use of the cache or memory system.

Adding a vmove that does typecast/data conversion instruction comes mostly for free.

If actually doing a move (source/dest differ), you can just use the provided vtype to control the address generator/control logic for the associated register read port. That hardware already exists, you are merely activating it, and it gets rewritten to the correct destination in the correct format.

If the data doesn’t move (section/dest the same), you may be able to attach the vtype property to the register / register group so it gets activated every time that register/register group is read from, up until the point it is redefined. This would require new dedicated logic, but it is only for high performance / low power implementations.

There may be increased cost when width conversion and sign/zero extension or truncation are done at the same time. I think most of this hardware already exists in the data path for widening and narrowing instructions, but it also needs to be activated. I haven’t thought this part through as thoroughly.

Guy

On Thu, Feb 13, 2020 at 3:43 PM David-Horner notifications@github.com wrote:

I was struggling with a related (and perhaps the same issue) of splitting one LMUL register set into two with the same SEW but LMUL = (original LMUL) /2

This can be done by a vstore of the full register group and 2 consecutive (in memory) vloads at 1/2 the LMUL all with the same SEW.

However, I was wanting to avoid the memory references.

So what I came up with is successive vshift and vor. It's ugly. If there is something better that is efficient, great. If not specific hardware assist sounds useful for the case I was considering.

I think amalgamating 2 LMUL register groups into a single LMUL2 might be simpler, but I haven't worked through that yet.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/riscv/riscv-v-spec/issues/362?email_source=notifications&email_token=ABMAPWJUFAHJCVJ2WRTDG23RCXLJ7A5CNFSM4KGYR7Z2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELXA6OA#issuecomment-586026808, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMAPWK72CGVKBZ6SGYL6DTRCXLJ7ANCNFSM4KGYR7ZQ .

vbx-glemieux commented 4 years ago

This might work using vrgather, but the permutation pattern needs to be defined and it gets more complicated when register groups are used.

Should this be fractured out as a separate (but related?) issue in github?

Guy

On Thu, Feb 13, 2020 at 4:07 PM Guy Lemieux glemieux@vectorblox.com wrote:

Actually, I take back the mostly free statement. I don’t think it’s true — data needs to move across lanes, and since we are changing SEW/LMUL that isn’t required by any other operation, the hardware will have to be newly added.

I was mistakenly thinking only about data size conversions with constant SEW/LMUL.

Guy

On Thu, Feb 13, 2020 at 4:01 PM Guy Lemieux glemieux@vectorblox.com wrote:

David, not only is the code that does not depend upon vload/vstore ugly, but I believe it is very difficult to generalize for all possible SLEN values. My proposal allows creation of portable software that does not depend upon SLEN settings and does not require use of the cache or memory system.

Adding a vmove that does typecast/data conversion instruction comes mostly for free.

If actually doing a move (source/dest differ), you can just use the provided vtype to control the address generator/control logic for the associated register read port. That hardware already exists, you are merely activating it, and it gets rewritten to the correct destination in the correct format.

If the data doesn’t move (section/dest the same), you may be able to attach the vtype property to the register / register group so it gets activated every time that register/register group is read from, up until the point it is redefined. This would require new dedicated logic, but it is only for high performance / low power implementations.

There may be increased cost when width conversion and sign/zero extension or truncation are done at the same time. I think most of this hardware already exists in the data path for widening and narrowing instructions, but it also needs to be activated. I haven’t thought this part through as thoroughly.

Guy

On Thu, Feb 13, 2020 at 3:43 PM David-Horner notifications@github.com wrote:

I was struggling with a related (and perhaps the same issue) of splitting one LMUL register set into two with the same SEW but LMUL = (original LMUL) /2

This can be done by a vstore of the full register group and 2 consecutive (in memory) vloads at 1/2 the LMUL all with the same SEW.

However, I was wanting to avoid the memory references.

So what I came up with is successive vshift and vor. It's ugly. If there is something better that is efficient, great. If not specific hardware assist sounds useful for the case I was considering.

I think amalgamating 2 LMUL register groups into a single LMUL2 might be simpler, but I haven't worked through that yet.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/riscv/riscv-v-spec/issues/362?email_source=notifications&email_token=ABMAPWJUFAHJCVJ2WRTDG23RCXLJ7A5CNFSM4KGYR7Z2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELXA6OA#issuecomment-586026808, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMAPWK72CGVKBZ6SGYL6DTRCXLJ7ANCNFSM4KGYR7ZQ .

David-Horner commented 4 years ago

I believe there are a number of ideas that are converging to make the much superior combined product motivated by the removal of fixed width load/stores. Fractional LMUL and also the lmul calculation extension I proposed #376 as two examples Enhanced storage/buffering is another.

The thread you started here is relevant to this discussion. And as you mention it is relevant to other tasks/considerations as well. However, I don't know how to encapsulate the issues, there are dimensions that I haven't worked through yet. The new issue could just be the proposed SEW/LMUL transform instruction you first mentioned. Or it could be a discussion of tradeoffs for multiple instruction transforms. Or the differing nature of LMUL > 1 and fractional LMUL. And there's the suggestion that more (almost all?) operations are widening or narrowing or both. Does separating issues out help to derive the best overall plan, or does it impede the realization of interacting aspects? Premature optimizations and gestalt?

If you do make a separate issue I will contribute there.

I added #382 to discuss the third ite, above: the differing nature of LMUL > 1 and fractional LMUL.

bobdreyer commented 4 years ago

I offered to share some real code that was mixed integer & floating point that would be impacted by the loss of these instructions. Below are 2 versions of the inner loop one where byte loads are used and another where only SEW loads are available. The additional instructions in the inner and enclosing loop contribute negligible incremental overhead. (Loop has been coded with GNU Extended ASM.)

With vlb, the relevant code looks like: asm volatile ("vlb.v v0, (%[addr])\n\t" :: [addr] "r" (vec[index]) : ); asm volatile ("vadd.vx v0, v0, %[offset]\n\t" :: [offset] "r" (u) ); asm volatile ("vfcvt.f.x.v v0, v0\n\t"); asm volatile ("vfmacc.vf v8, %[scale], v0\n\t" :: [scale] "f" (v));

Without vlb, the code looks like this: asm volatile ("vsetvli x0, %[vec_len], e8, m2\n\t" : : [vec_len] "r" (VEC_LEN) : ); asm volatile ("vle.v v0, (%[addr])\n\t" :: [addr] "r" (vec[index]) : ); asm volatile ("vwcvt.x.x.v v4, v0\n\t"); asm volatile ("vsetvli x0, %[vec_len], e16, m4\n\t" : : [vec_len] "r" (VEC_LEN) : ); asm volatile ("vwcvt.x.x.v v16, v4\n\t"); asm volatile ("vsetvli x0, %[vec_len], e32, m8\n\t" : : [vec_len] "r" (VEC_LEN) : ); asm volatile ("vadd.vx v16, v16, %[offset]\n\t" :: [offset] "r" (u) ); asm volatile ("vfcvt.f.x.v v16, v16\n\t"); asm volatile ("vfmacc.vf v24, %[scale], v16\n\t" :: [scale] "f" (v));

The inner loop goes from about 10 instructions to about 15.

solomatnikov commented 4 years ago

Repeating from the mailing list: the important thing is the number of cycles it takes to execute some code sequence on the vector HW, not the number of instructions.

Even in simple single-issue vector core it is often possible to execute vsetvli with zero impact on the total number of cycles (I am talking about simulation of synthesizable RTL). For dual or multi-issue core it is even more likely.

Let's assume there are 2 vector pipelines: memory and arithmetic. They can operate independently and can be chained with some cycle overhead.

Let's assume vle8.v occupies the memory pipeline for N cycles (where N >= 8 for large VL/LMUL), chaining overhead is 4 cycles, and vlb.vwith extension occupies the memory pipeline for 4N cycles.

Original code with vlb.vwith extension:

vlb.v     4N
vadd      4
vfcvt     4N
vfmacc    4N

Total 12N + 4

With vle8.v (@kasanovic proposal) and quad widening vector add:

vle8.v    N
vsetvli   0  (b/c the next instruction cannot be started immediately)
vwadd     4
vsetvli   0  (b/c arithmetic pipeline is still occupied)
vfcvt     4N
vfmacc    4N

Total 9N + 4

solomatnikov commented 4 years ago

Of course, the above examples of cycle calculations are for single iteration of the loop.

Loop(s) with software pipelining can have lower cycle cost, e.g.: reduce cost of chaining to zero.