riscv / riscv-debug-spec

Working Draft of the RISC-V Debug Specification Standard
https://jira.riscv.org/browse/RVG-94
Other
451 stars 93 forks source link

Discovery of incompatible base ISAs #975

Open sorear opened 6 months ago

sorear commented 6 months ago

SIG CHERI is extending RISC-V by adding metadata to each integer register, up to a size of 65 or 129 bits. An external debugger must be aware of the full size of the registers so that it can save and restore them when using a hart to run code.

How should a debug module communicate to an external debugger that harts are using an incompatible base ISA? The configuration structure does not seem to be applicable since it may be read using a hart.

rtwfroody commented 6 months ago

The way to access GPRs is through the abstract Access Register command, which supports the following access sizes: 32 bit, 64 bit, and 128 bit. So to support this extension, the Access Register command must be extended to support XLEN of 65 and 129. The spec is frozen and the public review period ended February 17.

What is the process is for making such changes at this stage?

pdonahue-ventana commented 5 months ago

I'm not an expert on the process but I think that it's too late to suggest substantive changes at this point. According to the org chart, the CHERI TG doesn't exist yet. One option is to have the CHERI TG charter specifically say that they're supposed to create any necessary debug hooks. Then this type of thing could be proposed by that group.

sorear commented 5 months ago

To be clear, CHERI already has a way to access GPRs (which kind of sucks, but enhancing it is a job for the CHERI SIG/TG).

The problem is that a debugger written against the frozen 1.0 debug spec can't know not to use integer Access Register operations against CHERI cores, unless the CHERI SIG bumps dmstatus.version. Is this an appropriate thing to use up one of the 13 remaining dmstatus.version values for, or should 1.0 be modified to allow for cores that "aren't controllable without a TBD extension"?

It would also be possible for the CHERI SIG to define Access Register to fail noisily on non-RVnnI cores, but the 1.0 debug spec doesn't require debuggers to have any error handling for missing required features, so that doesn't actually help.

pdonahue-ventana commented 5 months ago

It doesn't seem like you need to modify the DM to add new 65b and 129b abstract commands. You could just add additional registers (in the range 0x1020-0xbfff) that access the 65th bit. Long ago, I worked on a custom architecture with 36b registers and a normal 32b memory system (which is a long story). We had stores that would store the "normal" 32b of the register and stores that would store the "extra" 4b with 28b of padding. You can similarly extract the extra bits in a second step.

Registers 0x1000-0x101f are the GPRs and if you define (for example) 0x1040-0x105f to be the extra bits of the GPRs then you can do an abstract read of 0x1001, an abstract read of 0x1041, then concatenate them to determine the full value of x1. On a non-CHERI hart, the DM would set cmderr to indicate that this isn't supported. This would all work with the existing DM architecture (and possibly with existing DM implementations if the DM just blindly ships commands off to harts and waits for data or error response).

aap-sc commented 5 months ago

@pdonahue-ventana according to the spec, we have:

0x0000 — 0x0fff CSRs. The ``PC'' can be accessed here through dpc.
0x1000 — 0x101f GPRs
0x1020 — 0x103f Floating point registers
0xc000 — 0xffff Reserved for non-standard extensions and internal use

So the suggested range should be 0xc000 — 0xffff, right?

sorear commented 5 months ago

@pdonahue-ventana let's say we do that. A debugger attaches to a system and needs to read memory, but the System Bus Access feature is not implemented, so it picks a hart at random and halts it. Access Register commands are used to read the old value of x1 and write an address into x1, a ld x1, 0(x1) instruction is executed from the program buffer, then Access Register is used again to read the loaded data and restore the old x1 before resuming the hart. As far as the debugger knows, everything is fine. Much like RV64, CHERI defines all instructions which write to general registers to write to the entire register, so the ld cleared the upper 65 bits of x1, which the debugger didn't know to save and restore, so the code running on the hart (picked at random during a debugger memory operation) most likely crashes shortly after resuming.

rtwfroody commented 5 months ago

@sorear How is this solved for an OS that needs to know whether the hardware it's running on has the CHERI extension? If the OS doesn't use the extra bits, but a user program does, then presumably it has the exact same problem.

sorear commented 5 months ago

@rtwfroody U-mode software can't touch the extra bits unless senvcfg.CRE=1. (sstateen0 doesn't work because some of the state needs to be swapped between U-mode and S-mode, not just different U-mode contexts.) I don't think this helps 1.0-standard debuggers, though.

rtwfroody commented 5 months ago

Thanks. I see two possible solutions:

  1. Like Paul suggested, put the extra bits in a different abstract command register, that must be implemented if CRE is supported. The debugger must check whether the extension is supported by attempting to read that register. This would be backwards compatible.
  2. Relay the size of a GPR to the debugger somewhere, explicitly as a number of bits. This would be more flexible in case there are more such extensions in the future (and people start combining them). There are extra bits in hartinfo that might be used. Then we add another aarsize encoding that just means "get all bits".

I think my preference is for the second option.

I don't know enough about process, but I suspect that @pdonahue-ventana is right and the way to get a change like this done is for the CRE task group to do it.

sorear commented 5 months ago

Unless I'm missing something, neither of those proposals is backward compatible in the sense that trying to use an old debugger will safely give an error message without causing data loss or unpredictable behavior. You've made it possible for debuggers to perform XLEN-bit operations on wider-than-XLEN registers, which destroys high data during program buffer execution.

rtwfroody commented 5 months ago

That is right. Do you have an idea that would be backwards compatible? I don't see how to make it happen, given that a debugger has to save some GPRs, and that there is more state than existing debuggers expect. We could bump the debug version when implementing this, which would cause existing debuggers to complain loudly when they encounter such a new hart.

rtwfroody commented 5 months ago

@sorear if you have time, tomorrow's meeting would be a convenient time to talk about this: https://lists.riscv.org/g/tech-debug/message/1636

sorear commented 5 months ago

@rtwfroody If the current debug draft is completely frozen and anything written against it is considered correct, the only option I'm aware of is to increase dmstatus.version for debug modules if any harts are incompatible with run control as currently specified.

I was hoping there might be some per-hart mechanism for communicating this.

I still don't know how precious the remaining values of dmstatus.version are, and how hard we should try to not use one. Do we expect all revisions of the debug specification to increase dmstatus.version or is it a last resort mechanism?