riscv / riscv-fast-interrupt

Proposal for a RISC-V Core-Local Interrupt Controller (CLIC)
https://jira.riscv.org/browse/RVG-63
Creative Commons Attribution 4.0 International
237 stars 49 forks source link

usefulness of PUSHINT/POPINT #108

Open Kevin-Andes opened 3 years ago

Kevin-Andes commented 3 years ago

From David Horner, Oct 19 2020, message #192

Note: cc'ing tech-code-size TG list. Please drop them off as appropriate.

The tech-code-size TG is proposing instructions that store and load relative to sp (R2) not only GPR but also mret, mcause and mepc.

https://github.com/riscv/riscv-code-size-reduction/blob/master/ISA%20proposals/Huawei/riscv_push_pop_extension_RV32_RV64.adoc#pushint-epopint-e I expect many of the obvious warts can be readily fixed, such as : I believe mtval and/or mstatus [for legacy interrupt] csr was intended rather than the mret instruction. The m-, s-, u- (etc. ) designation for xtval/xstatus, xcause, xepc will be determined by the x-mode it is executed under. Ditto for inclusion of mtval and mstatus dependent upon trap/interrupt type/state.

However, some fundimental characteristics my be more difficult to address.

The breadth of flexibility in the fast-interrupt design is challenged by advocating prepackaged instructions that might become defacto standards, affecting

     a) minimalist micro-arch designs that would have to incorporate the underlying state machine by

                     i)  performing the micro-ops via a sequencer or

                     ii) a monolithic multi-cycle instruction implementation.

           neither is consistent with our goal of targeting embedded/IOT.

     b) software ecosystem will be  dis-incentivized to improve/optimize the simpler sequences.

Our objectives difffer from the code-size TG. Although code size reduction can be achieved using these instructions, there is no general usefulness for them. They are practically valuable only in trap handlers. Granted, reducing the code may help cache locality, and for a vectoring interrupt design that has discrete entry points per vector there would be a proportional [and possibly] significant code size reduction. However, dependence upon these macro-instructions woulod mitigate against the consolidation of these vectors which would provide a net improvement.
Similarly, removing the instruction completely may be a better option: to include the macro-functionality into the vectored interrupt itself.

Although the functionality currently proposed can certainly be incorporated into some handlers with a net performance boost, they are not tailored to handler requirements.

Notably, (and I expect obvious to our group) is that the xvtal/xstatus, xcause and xepc handling.

       The resultant value of a read of the csrs is minimized over the simpler approach as
             1) they are only pushed to the stack.
                       Either the value from the stack will need to be subsequently read or the csrr reexecuted.
                       In either case we can expect for many implementations slower handling of a pending same mode interrupt.
                            either as progress to either exit critical code or stall in reading stack, potentially from write queue.
             2) no provision for concurrent csr setting is present in the current formulation. 

1 above can be mostly addressed by a PUSHPOPulateINT that store xtval/xstatus, xcause and xepc in pushed GPR registers.

     However, for handlers that don't immediately use those values, the wait to load into the GPR is counter our objectives.

A potential benefit within the interrupt realm is that just, just as other long running processes (notably divide) can process over an intgerrupt, so can these macro-ops, especially if they were sufficiently comprehensive to also enable interrupts.

Finally, a suggestion that examining these possibilties can give us insight into avenues of improvement and also provide a test bed for investigations.

Kevin-Andes commented 3 years ago

From: Tariq Kurd, Oct 21 2020, #193

Hi David,

Thanks for this analysis.

I added PUSHINT/POPINT to my proposal specifically for the interrupt handler, so any requests to change to the specification from the interrupt group would be gratefully received.

The specifications fit nicely with PUSH/POP which is why I’ve specified them as part of the code size reduction work, they certainly will have minimal benefit for code size reduction (so no reason to have 16-bit encodings for them).

However – they lend themselves nicely to clever micro-architectural mechanisms to rapidly context save/restore on handler entry/exit so really they’re probably more relevant for code-speed than code-size, so I’ve CC’d Jeremy.

Tariq

Kevin-Andes commented 3 years ago

From: Allen Baum, Oct 21 2020, #194

I see that there is a bias against microsequenced implementations for microcontrollers, as if they substantially increase... something. I see quite the opposite. IF you don't allow them, you're effectively forcing a multiply to be single cycle. There will be microsequences. Having them can simplify, speed up, and lower the memory footprint of interrupt handling, and reduce silicon area. Reducing that memory footprint can also have a second order effect of lowering icache pressure, thus saving power and increasing performance. Fast and easy interrupt handling may be one of the core characteristics of a microcontroller.

Kevin-Andes commented 3 years ago

From: David Horner, Oct 21 2020, #195

On 2020-10-21 7:08 p.m., Allen Baum wrote:

I see that there is a bias against microsequenced implementations for microcontrollers, as if they  substantially increase... something.
I see quite the opposite. IF you don't allow them, you're effectively forcing a multiply to be single cycle.

Rather than a bias, it is a recognition that adoption of these instructions as central to the fast-interrupt design limits implementation options.

They are heavy-weight instructions.

Some target implementations won't have multiply let alone divide.

And yet I muse that these instructions could operate similarly to many divide implementations, and continue to operate concurrently with a preempting trap.

There will be microsequences.

I considered this a design choice.

Having them

are you speaking of the PUSH/POPINT instructions?

can simplify, speed up, and lower the memory footprint of interrupt handling, and reduce silicon area.
Reducing that memory footprint can also have a second order effect of lowering icache pressure, thus saving power and increasing performance.
Fast and easy interrupt handling may be one of the core characteristics of a microcontroller.

I note that all you have said can apply to PUSH/POPINT instructions, each may be beneficial in those ways to some implementations.

The potential good is inarguable. I am especially concerned about and want to be on guard for the potential bad.

Kevin-Andes commented 3 years ago

From: Allen Baum, Oct 21 2020, #196

We may be arguing different things. If we are finding instructions that use micro sequencing unacceptable, we are limiting implementation choices. If we are saying that a specific op would require micro sequence - we are limiting architectural choices. In this case, we are talking about a code size extension targeting ( but not limited to) a micro controller profile. It could easily take advantage of this instruction, but could survive without it as well. It’s an extension- you can implement or not. The real question is whether a micro controller profile requires it, makes it optional, or will not support it at all while requiring the rest of a code size reduction profile.

If a micro controller doesn’t think it’s getting bang for the buck- don’t implement it. But a micro sequencer implementation is far cheaper than many other optional features for that class of machine. It would be interesting to compare that to the area of RAM or ROM it would replace.

-Allen

Kevin-Andes commented 3 years ago

From: David Horner, Oct 21 2020, #197

On 2020-10-22 1:09 a.m., Allen Baum wrote:

We may be arguing different things. 
If we are finding instructions that use micro sequencing unacceptable, we are limiting implementation choices.

Not that they are unacceptable per se. But MAY be unacceptable to build into the fast-interrupt recommendations.

If we are saying that a specific op would require micro sequence - we are limiting architectural choices.

And we are not saying that. Anyone can implement however they want, a novel implementation may work better in this limited context that micro sequencing.

Advances in "dark silicon" may make a combinatoric logic implementation optimal. No clocking of a full sequencer, just go the flow.

In this case, we are talking about a code size extension targeting ( but not limited to) a micro controller profile. It could easily take advantage of this instruction, but could survive without it as well.
It’s an extension- you can implement or not.

The real question is whether a micro controller profile requires it, makes it optional, or will not support it at all while requiring the rest of a code size reduction profile.

exactly. I didn't actually think we were arguing. We are discovering the nuances. As I said "insight into avenues of improvement".

If a micro controller doesn’t think it’s getting bang for the buck- don’t implement it. But a micro sequencer implementation is far cheaper than many other optional features for that class of machine. It would be interesting to compare that to the area of RAM or ROM it would replace.

You left out power consumption from the trade-offs, critical for such chips.

I mention other tradeoffs. Including, the possibility of eliminating the instruction completely by incorporating into the interrupt housekeeping [like the populating of the mcause fields]

I greatly appreciate your interaction.

Kevin-Andes commented 3 years ago

From: Tariq Kurd, Oct 22 2020, #198

If an implementation supports PUSH/POP then supporting PUSHINT/POPINT should add little more complexity.

I think the question is how we structure the extension – I wouldn’t personally require PUSHINT/POPINT to be implemented in the fast interrupt spec, for the same reason that I intend to have an option to exclude all multi-step instructions from the future code-size reduction ISA extension (e.g. PUSH/POP) – not everyone will want to implement multi-step sequences.

Adding more and more extensions is certainly bad for delivering libraries as they add more options, but I expect PUSHINT/POPINT to be in __asm inserts only, or even directly written in assembler, so there should be no effect there.

What I propose is an extension something like Zclicpushpopint (not sure the Zclic bit is right…..), and to give the spec to the fast interrupt team to complete.

Does that sounds like a sensible approach?

Tariq

Kevin-Andes commented 3 years ago

From: David Horner, Oct 22 2020, #199

Tariq Kurd:

and to give the spec to the fast interrupt team to complete.

Well we have already done that. In part through this email chain.

On 2020-10-22 6:11 a.m., Tariq Kurd wrote:

If an implementation supports PUSH/POP then supporting PUSHINT/POPINT should add little more complexity.

not so. csr may be significantly differently accessible than GPR.

I agree that for many implementations it could be very little add.

I think the question is how we structure the extension – I wouldn’t personally require PUSHINT/POPINT to be implemented in the fast interrupt spec, for the same reason that I intend to have an option to exclude all multi-step instructions from the future code-size reduction ISA extension (e.g. PUSH/POP) – not everyone will want to implement multi-step sequences.

good plan.

Adding more and more extensions is certainly bad for delivering libraries as they add more options, but I expect PUSHINT/POPINT to be in __asm inserts only, or even directly written in assembler,

reasonable for many cases, but fast-interrupt TG is targeting full software stack support.

so there should be no effect there.

Again, not so. the cascade effects could be substantial, especially as more functionality of PUSH/POPINT is incorporate into the trap hardware management.

Coordinating the behaviour of hart with/without hardware push/pop assist with/without PUSH/POPINT instruction support and, compiler and loader optimizations begins to explode in complexity.

What I propose is an extension something like Zclicpushpopint (not sure the Zclic bit is right…..), and

There is a thread [that is dropped from github] about profile slice and dice of functionality and how to manage.

[riscv/riscv-v-spec] Names for embedded vector extension (#550)

And this is the issue in github:

https://github.com/riscv/riscv-v-spec/issues/550

The question there is if such Z* names is better divorced from specification ratification.

It can certainly begin before the draft is submitted, but it may be better not to clutter thinking/prep for submission

to defer finalizing the partitioning to sub- and sub-sub-Znames until after.

I think the success of ZFinX as a placeholder gave false utility to Z-names.

ZFinX' success is not in its Z-formulation, but rather the distinct idea that could have been expressed in other ways including Ψ-extension.

Nor does it fit the Z-formulation model cleanly. ZDinX and ZQinX are not proposed (nor discussed) because, in part,

1) they muddy the waters.

2) Z*inX has little to do with F/D/Q which mandate a float register file.

3) is in reality a ZiFloat support.

And so we see the complexity of adding Z-notation early. It is a distraction in trying to get undefined components properly named.

to give the spec to the fast interrupt team to complete.

As I mentioned earlier, we have done that.

And already received invaluable input from yourself, including this email, from Allen Baum , from Bill Huffman indirectly through V-spec and others on other threads.

What is important is to keep the discussion going. [I acknowledge that my intervention style may stifle. Not my intent.]

Does that sounds like a sensible approach?

The Z-naming has little value, but contributing the concepts that

1) multi-step ops can/should be considered differently.

2) that, therefore, PUSH/POP are not expected to be mandated within the code-size-reduction standard

3) that therefore perhaps fast-interrupt should not rely on it as central to its proposal.

all of that, the opportunity to raise the awareness of the pitfalls of Z-naming

and the other ideas that I have glossed over or am too dense to perceive,

are all worthwhile contributions to the discussion/contemplations.

Kevin-Andes commented 3 years ago

From: Allen Baum, Oct 23 2020, #200

This discussion is bringing up an issue that needs wider discussion about extensions in general. Risc-V is intended to be an architecture that supports an extremely wide range of implementations, ranging from very low gate count microcontrollers, to high end superscalar out-of-order processors. How do we evaluate an extension that only makes sense at one end or the other?

I don't expect a vector, or even hypervisor extensions in a low gate count system. There are other extensions that are primarily aimed at specific applications areas as well.

A micro sequenced (e.g. push/pop[int]) op might be fairly trivial to implement in a low gate count system (e.g. without VM, but with PMPs) and have significant savings in code size, power, and increased performance. They may have none of those, or less significant, advantages in a high end implementation -- and/or might be very difficult or costly to implement in them, (e.g. for TLB miss, interrupt, & exception handling ) (I am not claiming that these specific ops do, but just pretend there is one like that)

Should we avoid defining instructions and extensions like that? Or just allow that some extensions just don't make sense for some class of implementation? Are there guidelines we can put in place to help make those decisions? This same (not precisely the same) kind of issue is rearing its head in other places, e.g. range based CMOs.

Kevin-Andes commented 3 years ago

From:
Greg Favor, Oct 26 2020, #205

It seems like a TG, probably through the statement of its charter, should clearly define what types or classes of systems it is focused on optimizing for (if there is an intended focus) and what types or classes of systems it does not expect to be appropriate for. More concretely, it seems like there are a few TG's developing extensions oriented towards embedded real time systems and/or low-cost embedded systems. These are extensions that would probably not be implemented in full-blown Linux-class systems. Those extensions don't need to worry about being acceptable to such system designs, and can optimize for the requirements and constraints of their target class(es) of systems.

Unless I'm mistaken, this TG falls in that category. And as long as the charter captures this, then the extension it produces can be properly evaluated against its goals and target system applications (and not be judged wrt other classes of systems). And key trade-off considerations - like certain types of implementation approaches being acceptable or unacceptable for the target system applications - should probably be agreed upon early on.

Greg

Kevin-Andes commented 3 years ago

From: Robert Chyla, Oct 26 2020, #206

I agree with Greg's statements. For me 'code-size' is very important for small, deeply embedded/IoT-class small systems.

Work in other groups (bitmanip) will also benefit code size, but it is not primary focus I think as these will also improve code-speed.

Linux-like big processors usually have DDR RAM and code size is 'unlimited'. It should not hurt as code-size advances will benefit such big systems, but we should not forget about 'cheap to implement'='logic size' factors.

IMO 'code-size' and 'code-speed' will be pulling same rug (ISA-space) into opposite directions. We must balance it properly - having a rug in one piece is IMO most important.

Regards, /Robert

Kevin-Andes commented 3 years ago

From: Tariq Kurd, Nov 2 2020, #208

Hi Greg/Robert,

Trying to respond to different emails with the same subject line, maybe we should split the threads.

Unless I'm mistaken, this TG falls in that category. And as long as the charter captures this, then the extension it produces can be properly evaluated against its goals and target system applications (and not be

judged wrt other classes of systems). And key trade-off considerations - like certain types of implementation approaches being acceptable or unacceptable for the target system applications - should probably

be agreed upon early on.

The code-size group is supposed to cover all bases, although we know that it is most important for embedded. The intention is to separate out instructions which may be different to implement on high performance cores (i.e. the multi-step instructions like PUSH/POP). All cores should benefit from improved code size as it reduces I-fetch bandwidth and improves cache utilisation.

IMO 'code-size' and 'code-speed' will be pulling same rug (ISA-space) into opposite directions. We must balance it properly - having a rug in one piece is IMO most important.

Agreed. In general reducing instruction counts will benefit size and speed but I’m sure there will be cases where there are conflicts between the two optimisation points. We need a holistic solution with the two groups working together. It’s unfortunate that I missed the first code-speed meeting, but will attend in future (and Jeremey/Wei Wu attend the code-size meetings)

For PUSHINT/POPINT, thanks for the detailed email David.

If an implementation supports PUSH/POP then supporting PUSHINT/POPINT should add little more complexity.

not so. csr may be significantly differently accessible than GPR.

Yes that’s possible. I’m thinking that the micro-ops will be issued as a sequence of standard RISC-V instructions, so could be CSR ops, after all POPRET can’t only issue as a state machine in the LSU as it needs to update SP and also RET. But granted CSR ops are different from any sequenced micro-ops specified so far.

Again, not so. the cascade effects could be substantial, especially as more functionality of PUSH/POPINT is incorporate into the trap hardware management.

Coordinating the behaviour of hart with/without hardware push/pop assist with/without PUSH/POPINT instruction support and, compiler and loader optimizations begins to explode in complexity.

The question there is if such Z* names is better divorced from specification ratification.

It’s a good point, we can finish the instruction definitions before dividing into sub-categories, the main thing is to get the instruction definitions correct. Whether PUSHINT/POPINT are mandated is a separate problem.

ZFinX' success is not in its Z-formulation, but rather the distinct idea that could have been expressed in other ways including Ψ-extension.

Nor does it fit the Z-formulation model cleanly. ZDinX and ZQinX are not proposed (nor discussed) because, in part,

The misconception here (and it has come up before) is that the F in Zfinx refers to the F-extension, and it doesn’t. It refers to the F-registers.

The F-registers can be 32/64/128-bit, and whatever width they are they get shared with the X registers (which can also be 32/64/128-bit), so the name ZFinX works for all configurations.

Tariq

kasanovic commented 3 years ago

The TG discussed this during fast-int meeting. Some comments:

Code size is not a major concern with ISRs, but performance could be. Hardware stacking of CSRs might be a feature that is implemented without ISA support as part of the interrupt handling mechanism, and this would seem preferable to adding this CSR form of push/pop instruction.

While push/pop int instructions might allow more flexible handler design, most of the handler flexibility would be outside the save/restore mechanism in any case. It is not clear that the parameterization needed for stacking using the push/pop instruction would be any less than the equivalent parameterization in the hardware stacking mechanism.

General push/pop instructions could be useful in a handler, but would have to understand the impact on interrupt response time (if not pre-emptible) or interrupt throughput (if uses restart to handle preemption). A design that allowed resumable preemption through many interrupt levels could be difficult to support and require additional exposed state to be saved and restored.

Another proposal is to have a separate push/pop instruction only for interrupt CSRs, which would need less encoding space - the main advantage of this instruction would be to allow greater speed with a secondary advantage being fewer bytes of code to fetch in handler.

We think we can delay this discussion until other issues are resolved in the design.

kasanovic commented 2 years ago

We revisited this topic again but decided to defer to broader discussion on providing hardware stacking of interrupt contexts (possibly post-1.0).