A less wastefull approach to separate mask regfile

riscv / riscv-v-spec

Working draft of the proposed RISC-V V vector extension

https://jira.riscv.org/browse/RVG-122

Creative Commons Attribution 4.0 International

954 stars 271 forks source link

A less wastefull approach to separate mask regfile #617

Open jnk0le opened 3 years ago

jnk0le commented 3 years ago

There is a recently raised discussion about separating mask regfile together with mask load/store changes. The main argument against is resource waste due to need for supporting SEW=8 at LMUL=8, times 8 or 16 mask registers.

e.g.

krste: @sols,lidawei: Adding more dedicated mask register state increase cost/complexity for all machines. Long LMUL needs a lot of bits to hold mask. Dropping longer LMUL would reduce efficiency of simple machines.

So, my proposal is about going with LMUL like grouping but for mask registers as to use mask regfile bits more effectively.

There willl be 32 vm registers of VLEN/8 size each
a new EMLMUL (effective mask LMUL) which will be one of the two:

a) fixed to LMUL as to go easy on hardware or something

SEW	LMUL	EMLMUL
8	1	1
16	2	2
32	4	4
64	8	8
8	8	8
64	1	1

b) EMLMUL = LMUL/(SEW/8)

SEW	LMUL	EMLMUL
8	1	1
16	2	1
32	4	1
64	8	1
8	8	8
64	1	1/8

base instructions will "fetch" mask from vm0 to vm{EMLMUL-1} EDIT:to clarify this point - implicit mask consumption
other LMUL rules apply too

Total capacity of mask regfile is equivalent to 4 "old fashioned" mask registers (and according to the list, there is need for 8-16 of them for anything but SEW=8 at LMUL=8). In most usual cases there will be more usable registers and if required LMUL can be lowered.

jnk0le commented 3 years ago

Of course it can be lowered to just 16 vm registers, aka 2 "old fashioned" mask registers.

kasanovic commented 3 years ago

Adding separate mask registers is too drastic a change at this point in spec for unclear advantage.

indeets-vasily commented 3 years ago

I'm speaking from a programmer's perspective, so I want to apologize in advance if I am blatantly unaware of some engineering common knowledge.

Reading the specs I can't see where "8 or 16 mask registers" came from. My understanding is that even with maximum LMUL_MAX=8 the mask will still use 1 physical register (only VLEN bits), as minimum supported SEW_MIN=8 bits, giving required number of mask bits that is equal to LMUL_MAX * VLEN/SEW_MIN = 8 * VLEN/8 = VLEN. EDIV extension currently specifically states that sub-elements can't be individually masked, so presence of EDIV does not increase number of required mask bits. When fractional LMUL is used, only one physical register is used to hold this vector group, so by definition it can't hold more than VLEN elements.

I am concerned that while separate mask will require only 1 additional VLEN sized register, holding it in the v0 means that when (some) operations are masked:

1) LMUL - 1 physical registers are effectively "wasted" (they don't hold any useful values and can't be read/written to). This is true regardless of whether mask have to be kept or not.

2) Applications are left with LMUL / 32 less registers they can operate with if mask have to be kept for subsequent operations.

With LMUL=8 this leaves only 3 out of 4 register groups that can be used to store data and results (25% less), with LMUL=4 one would have 7 out of 8 register groups (12.5% less).

For image processing this is a very hard restriction as, for example, converting between different color spaces means that each of 3 new channels is a function of 3 original channels, meaning results can't be written into any of sources and minimum 4 register groups are needed. While color space conversion is usually done on the whole image and masking rarely needed, some image enhancing/processing routines have the same restriction (output(s) is a function of up to 3 inputs) while performed on portions of image (i.e. require a mask).

"Fancy" blending of background with foreground with transparency require 7 inputs (3 background RGB channels + 4 foreground RGBA channels) and needs dedicated output register + mask/stencil.

I would assume that there are quite a number of other algorithms beside image processing for which difference between 3 and 4 or 7 and 8 register groups is difference between "we can implement it" and "it can't be implemented with this number of register groups without excessive loads/stores". Also it just appeared to me that masked FMA with LMUL=8 will need to overwrite either mask or one of its inputs, which kind of erodes the point of it being 4-operand.

While I understand that integrating separate mask register into spec would be hard from both engineering side (adding cost to implementations) and administrative side (spec is being finalized and this will be a big disturbance), I think before spec hits v1.0 it is important to know the answer to the following questions:

Will separate mask register(s) appear in the following spec revisions/extensions?
Will decision to keep mask in v0 preclude or significantly hamper ability for implementations to include separate mask register(s)?
If later specs or some implementations will decide to include separate mask register(s) what about code fragmentation? Code that can use 4 register groups + mask will be significantly different from code that can use only 3 register groups + mask. How well potential opcodes for implementations with separate mask register(s) interact with current spec?

jnk0le commented 3 years ago

Reading the specs I can't see where "8 or 16 mask registers" came from.

This thread: https://lists.riscv.org/g/tech-vector-ext/topic/vector_task_group_minutes/78727431?p=,,,20,0,0,0::recentpostdate%2Fsticky,,,20,2,0,78727431 +whatever happened on meetings.

indeets-vasily commented 3 years ago

@jnk0le Thanks for the link to the thread!

As far as I can see the "8 or 16" physical mask registers is pushed as extension of current spec capabilities (i.e. having more mask state than in current spec). For realizing current spec capabilities only 1 physical mask register is needed.

While extra dedicated mask registers will definitely could boost some HPC/graphics workloads, I don't dispute that this should be way outside of base spec. My worries about having mask in v0 and not in single dedicated register is that:

Single mask is used in way more workloads than several masks.
Mask in v0 leaves only 2ⁿ-1 registers to use for data which have following impacts:
- Effectively wastes a lot of register space for small n (which is equivalent to large LMUL).
- Many algorithms were developed in assumption that platforms have 2ⁿ data registers.
- Some of them can be modified to use one less data register, potentially with performance loss due to larger number of operations.
- Some of them can't be modified and will have to resort either to loads/stores (which they likely won't) or have to decrease LMUL, effectively using only half of available register space and using larger number of smaller vectors.
Mask in v0 will potentially cause code to be non-portable when/if some implementation will decide to include separate mask register (or several of them).
Because of point # 3 working group will decide to prohibit using separate mask registers in performance cores (which due to point # 2 will make them much less performant).

I would like to make it clear that I'm not pushing or even proposing to include separate mask reg into the base spec. I just want know if this will still be possibility and potential code fragmentation is taken into account by WG.

jnk0le commented 3 years ago

As far as I can see the "8 or 16" physical mask registers is pushed as extension of current spec capabilities (i.e. having more mask state than in current spec). For realizing current spec capabilities only 1 physical mask register is needed.

RVV++ will be able to source masks explicitly, currently it's just storage that can be moved into v0/vm0 before use. vmmv + mask consume is definitely a subject for macroop fusion.

jnk0le commented 3 years ago

Also it just appeared to me that masked FMA with LMUL=8 will need to overwrite either mask or one of its inputs, which kind of erodes the point of it being 4-operand.

There are only FMA3s available, mask cannot be overwritten. FMA4 was available in earlier revisions and then dropped for some reasons.

Will separate mask register(s) appear in the following spec revisions/extensions? Will decision to keep mask in v0 preclude or significantly hamper ability for implementations to include separate mask register(s)?

Adding separate mask regfile after current design has been frozen/ratified is some x86 level crazyness. I don't think it will be done.

@kasanovic

I think we have similar problem as tail zeroed vs tail undisturbed back then.

Especially that MLEN=1 encoding, unlike 0.7.1 layout, doesn't have mask bits "in the right place". vrgather/vslides can tolerate longer latency/skewed pipelines but is it still simple to perform early masking for powersaving or skipping execution?

Furthermore, lets say that just 8 proposed vm registers (aka 1 "old fashioned" mask reg) is considered "enough" and the rest will be brought together with EDIV (and maybe more with RVV++). I'm neutral about the outcome here.

BTW: decoupled mask engine solves a lot of my doubts about #451 (maybe i'm wrong again)

indeets-vasily commented 3 years ago

@jnk0le

Sorry, but I'm a little bit confused:

In the previous comment you wrote

RVV++ will be able to source masks explicitly, currently it's just storage that can be moved into v0/vm0 before use.

which I've interpreted as "in the future extensions/revisions there will be the option to specify separate mask source".

But now you writing that

Adding separate mask regfile after current design has been frozen/ratified is some x86 level crazyness. I don't think it will be done.

So will there be possibility for implementations with mask(s) outside of data registers or not?

There are only FMA3s available, mask cannot be overwritten. FMA4 was available in earlier revisions and then dropped for some reasons.

Seems I've missed that one change. Anyway this was a side-note, my main point is that for some applications you need 4 or 8 data register groups for sources/destinations plus separate mask. And that holding mask in v0 "locks away" use of v1-v7 with various LMUL settings and leaves you with odd number of data register groups.

Furthermore, lets say that just 8 proposed vm registers (aka 1 "old fashioned" mask reg)

You still need 1 "new" physical mask register, because 1 "old fashioned" mask register was holding no more than VLEN bits of information.

@kasanovic

My main worries is that this version of mask implementation will inhibit image processing/scientific/HPC algorithms for RISC-V. I understand that current focus seems to be on embedded implementations, but it would be nice to know if mask registers that are separate from data is still possibility in the future or if they won't be allowed.

jnk0le commented 3 years ago

Sorry, but I'm a little bit confused:

In the previous comment you wrote

RVV++ will be able to source masks explicitly, currently it's just storage that can be moved into v0/vm0 before use.

which I've interpreted as "in the future extensions/revisions there will be the option to specify separate mask source".

But now you writing that

Adding separate mask regfile after current design has been frozen/ratified is some x86 level crazyness. I don't think it will be done.

So will there be possibility for implementations with mask(s) outside of data registers or not?

"source masks explicitly" as to let the instruction to select mask register by number instead implicitly sourcing. It's independent from where masks will be decided to store (data regs or separate).

The main no-gos for later introduction of separate maks regfile:

massive opcode duplication
software fragmentation (everyone wants to be backward compatible, and microarchs need to be performant in both cases)
separate mask regfile is intended also to simplify the microarch (even maybe reduce gates/area/power despite extra bit storage). Architecture able to source mask from data reg as well as separate mask reg, defeats this purpose.

nick-knight commented 3 years ago

The strongest argument I've seen in this thread for having a separate mask register is that the "implicit mask register" (v0) can waste a size-LMUL register group, and the increased register pressure forces use of a smaller LMUL. In my experience on SiFive's designs, increasing LMUL has diminishing returns (intuition: the datapath width isn't changing), so I don't view this as terribly upsetting. To change my mind, I guess I'd need to see a concrete application along with some hardware implementation details (like datapath width relative to VLEN) that help quantify the benefits of LMUL on that machine.

In any event, I agree with Krste that this would be a disruptive change: there are quite a few instructions that read/write masks from registers other than v0, so we'll end up needing multiple forms of these instructions, or auxiliary mask registers, or instructions to copy masks between to/from data registers, etc. I think the associated design tradeoffs have been debated in other threads; at some point I hope this rationale is written up in a white paper, because the lack of separate mask registers (and their number) is a notable distinction from, say, SVE.

Is the hybrid approach --- RVV instructions source their masks from v0 while RVV++ instructions source theirs from a separate mask regfile --- really a "no-go"?

I imagine RVV++ will have massive opcode "duplication" anyway: presumably it will at least widen the data regfile, so every RVV instruction would have a longer RVV++ form that can access the new registers.

And you can have backwards compatibility by supporting the implicit mask register (v0) approach for the RVV instructions. I don't have the expertise to estimate the power or area cost for this across all reasonable designs, but I would expect it be lower order compared to widening the data regfile and adding a mask regfile. (I could be completely wrong!)

kasanovic commented 3 years ago

The implicit mask register v0 does not "waste" an LMUL-size register group. Even at LMUL=8, vector registers v1-v7 can be used to hold other register-allocated values, including other mask values, or LMUL<8 vector register groups (such as narrowed values). In any case, I don't believe that many algorithms need exactly a power of 2 number of data registers, or that many codes run best with only 3-4 register groups using LMUL=8. Current tight (ILEN=32) encoding with only has a single active mask register v0, partly because there are no spare instruction bits to encode a mask specifier in any case. The expanded (ILEN=64) encoding would allow the vector register to be specified rather than implicitly v0, and would allow for a greater number of architectural vector registers.

jnk0le commented 3 years ago

there are quite a few instructions that read/write masks from registers other than v0

That's not that bad. Mask generators can write to any registers as well as dedicated mask instructions that can source/write from/to any register. (except it's own implicit mask in some cases) Then we can do macro-op fusion of mask writes to v0/vm0 + implicit consumtion

Is the hybrid approach --- RVV instructions source their masks from v0 while RVV++ instructions source theirs from a separate mask regfile --- really a "no-go"?

I imagine RVV++ will have massive opcode "duplication" anyway: presumably it will at least widen the data regfile, so every RVV instruction would have a longer RVV++ form that can access the new registers.

And you can have backwards compatibility by supporting the implicit mask register (v0) approach for the RVV instructions.

I strongly belive that if RVV end's up sharing data registers for mask, then RVV++ should stay the same.

the gain from extra mask space is questionable when we already have to waste transistors for all necessary mask-from-data trickery
treating base vector instructions as compressed forms of RVV++, not unwanted MMX/x87/SSE/avx/avx2/NEON legacy, requires RVV++ to still (explicitly) source/write masks from/to data registers.
due to above reason, use of separate masks will be rare
both types of masking needs to be implemented efficiently by microarchs, otherwise we get into a deadloop "nobody uses separate masks because microarchs don't optimize use of them because nobody uses separate masks"

indeets-vasily commented 3 years ago

@kasanovic

The implicit mask register v0 does not "waste" an LMUL-size register group. Even at LMUL=8, vector registers v1-v7 can be used to hold other register-allocated values, including other mask values, or LMUL<8 vector register groups (such as narrowed values).

Implicit mask register "waste" register group because you can't use reg group 0 in the current logic loop to hold source/destination values. And in my experience you quite often want your outputs to be same width as inputs (having to convert integer to at least 2 times wider floats — 4 times wider for 8-bit data on x86 — to do normalized multiplication is a source of frustration for me, I would appreciate platform that provides operation that multiplies 2 n-bit integers and divides them by 2n-1).

In any case, I don't believe that many algorithms need exactly a power of 2 number of data registers, or that many codes run best with only 3-4 register groups using LMUL=8.

In color images there are 3 color channels + 1 optional alpha channel. So quite often you end up needing 4 or 8 vectors:
- RGB + destination/intermediate for single image processing,
- 2 x (color channel + alpha) for 2 image alpha composing with source overwriting,
- RGBA top image + RGB background + destination/intermediate for blending with BG,
- 2 x RGBA with source overwriting if you do alpha composing holding all color channels separately in registers at once.
Widening FMA requires 2 sources and 1 widened accumulator which is equivalent to 4 single-width register groups.
Algorithms that are written using 8/16 registers to hold data just because x86 provides that many.

Maybe as you and @knightsifive pointed out using 4 register groups with LMUL=8 or is not most effective but:

What's the point of providing it then, if implementations are expected to make using LMUL=8 less efficient than 2 x LMUL=4.
Why preempt future implementations to provide efficient operations with large LMUL?
If programmers can do something, they probably will.
Examples I provide includes 8 register groups with LMUL=4, which you didn't address.
If using large LMUL is not significantly faster, we would like to do preloading on OoO implementations and use twice LMUL than logically needed.

Basically current spec will lead to a situation where a lot of computation algorithms will use twice lower LMUL and only half of registers for masked versions of algorithm. Which will not only make code maintenance and optimization harder, but potentially can cause the code that performs the same calculations on a lesser amount of data to perform slower. Because mask in a majority of cases represent "skip calculations for this data points" vs "do different calculations depending on a condition".

And I wouldn't like to rewrite hardest, math-heavy parts of my programs to somehow use lower number of registers to hold my data than is mathematically possible, all in order to avoid performance hit for a masked version of algorithm that I've already written and tested for correctness long time ago.

What about GPU and ML accelerators? I'm not IC engineer, so I may be totally wrong about how GPU works internally, but aren't they masking pixels that are not part of a triangle for rendering? In ML you have sparse weights, dropout regularization, strided operations, which all are masked operations. Wouldn't these implementations, due to a large number of cores, be inclined to maximize datapath and use all of their registers to hold data, without wasting physical space for registers that are unused for their domain-specific calculations?

Current tight (ILEN=32) encoding with only has a single active mask register v0, partly because there are no spare instruction bits to encode a mask specifier in any case. The expanded (ILEN=64) encoding would allow the vector register to be specified rather than implicitly v0, and would allow for a greater number of architectural vector registers.

I believe there's a misunderstanding. I don't argue for a lager number of mask registers. In fact, I am consistently disagreeing with @jnk0le on that. I am arguing for a one, separate mask register in a base spec.

Because as I am seeing it, it is "33 VLEN registers (32 for data + 1 for mask) + some changes in a spec" vs "32 registers for both masks and data + lower number of available data registers/register groups for masked versions of algorithms + even more register pressure with additional mask registers".

jnk0le commented 3 years ago

@indeets-vasily

I think they would be the most happy if provided with concrete C or SSE/AVX/NEON/shader code. Of course if you are not commercially bound. In this case maybe papers or linked OS code should do.

I am arguing for a one, separate mask register in a base spec.

There needs to be at least two, otherwise many mask instructions (and a few forms of masked mask instructions) become useless.

EDIT: of course there is also

Solmatnikov: One option is to allow mask generating instructions (compares) to write either to regular vector regs or to vmask and to provide move instructions between vector regs and vmask. But mask consuming instructions can use only vmask as a mask. Mask load and store are also only for vmask. This is no worse than current design.

I'm a bit skeptical about this approach.

Because as I am seeing it, it is "33 VLEN registers (32 for data + 1 for mask) + some changes in a spec" vs "32 registers for both masks and data + lower number of available data registers/register groups for masked versions of algorithms + even more register pressure with additional mask registers".

EDIT: proposal is of course to have mask regfile separated from data - no extra reg pressure. The top proposal is (raw bit storage) equivalent to:

33 VLEN registers for 8vm registers 34 VLEN registers for 16vm registers 36 VLEN registers for 32vm registers

+ some logic removed and some added, I'm unable to provide estimates here.

jnk0le commented 3 years ago

One more thing.

Microarchitectures with the internal register (mask) rearangements still have to implement logic for casting mask as data and vice versa. And it has to be done efficiently (e.g. do mask rearrangement cast during use rather than write) as that will happen on every context switch. It doesn't sound transistor efficient.

kasanovic commented 3 years ago

The mask design won't be changing this drastically for v1.0, if ever, so pushing this as post v1.0 where it is intended to discuss expanding mask capabilities (and should probably a new issue).

kobalicek commented 2 years ago

I also think that the current masking design is not practical.

Take a look at AVX-512 and SVE/SVE2.

I think especially in SVE, there is also the limitation of having 32-bit instruction words only, and they have solved the problem by having destructive destination/source register. So when you mask some operation, you basically go from 3 operand form to 2 operand form and a mask in most instructions. It's a tradeoff, but it's still very practical and in most cases you won't need non-destructive destination.

Also, treating vector register as a mask is a big waste - you really need the registers, there is a reason why AVX-512 extended the number of registers to 32 and introduced separate register file for masks.

I hope I just misunderstood the spec, I'm not sure it would be even possible to easily port existing AVX-512 or SVE code to RISCV-V.

jnk0le commented 2 years ago

it's now past the ratification, so any change here is unlikely.

I hope I just misunderstood the spec, I'm not sure it would be even possible to easily port existing AVX-512 or SVE code to RISCV-V.

porting from SVE should not be a big issue except having less total registers (mask + data). AVX porting needs VLA to be applied first, for both RVV and SVE

cmuratori commented 2 years ago

I would like to add to the general concern here. While it may be possible to demonstrate that the current masked instruction design is the best design, I don't know that it should be considered a settled matter at this point, so:

Is there a reason masked instructions could not be simply omitted from the V1 spec? This seems a more sensible thing to do if there is concern about it. Rather than ratify something that might cause problems later on because it was a bad decision, why not just have "ratified" RISC-V V only support select-style blending (and/and-not or) and hardware that wishes to use masking is considered its own experimental extension?

If the answer is "because we want a standardized version of masking because masking is important for performance", well, it seems unlikely based on existing non-RISC-V architectures that the best instruction set design for performance is going to be to conflate the vector registers with the mask registers. There's a lot of good reasons why you would want to separate them.

One of the difficulties with something liked masked vector instructions is that it takes a lot of experimentation, expertise, and analysis to actually be certain of how the instructions are best designed. If there's a question about it, why not give it more time? Many of us are only recently able to start looking at RISC-V for performance-oriented programming because the hardware is only now getting to the point where you might use it for serious HPC or graphics. Giving people more time to come on board, analyze, and report before ratifying seems like a better idea than pushing something through that might be a mistake, especially on leading-edge style instructions like masked vector ops.

- Casey

kobalicek commented 2 years ago

I agree, I really think this will be regretted in the future - this will end up like X86 - there will have to be future encoding that would have to fix the mistakes from the past, and RISC-V is not really that old...