DebugModule.dataX preservation

Dolu1990 commented 2 years ago

Hi,

I'm currently implementing the spec in hardware and there is one behaviour the spec isn't explicit about (or I missed it / implicit).

Let's say you run one abstract command which return/write data0. Can this new value be usable as a argument of the next abstract command ?

Because if it isn't, it allows more flexibility in the hardware implementation.

Location in the spec : https://github.com/riscv/riscv-debug-spec/blob/eeef4d580971d70b86e3fb498a0536654ae90608/xml/dm_registers.xml#L570

Thanks ^^

timsifive commented 2 years ago

Let's say you run one abstract command which return/write data0. Can this new value be usable as a argument of the next abstract command ?

That is certainly what I had imagined, but I cannot think of a good use case for it. (Copy value from one register to another, I suppose, but that's hardly common.)

I cannot imagine what optimization you can make if this isn't the case. How does it help you be require the debugger to always write data0 for an abstract command which will use it?

pdonahue-ventana commented 2 years ago

Let's say you run one abstract command which return/write data0. Can this new value be usable as a argument of the next abstract command ?

Imagine that the answer is no and that data0 is allowed to spontaneously change. At what point is it allowed to spontaneously change?

Take this normal sequence:

abstract read command that writes data0
debugger reads data0
debugger writes data0
abstract write command that reads data0

If data0 changes between 1 and 2 then it defeats the purpose of the abstract read command. If it changes between 3 and 4 then it defeats the purpose of the abstract write command. I don't see much value in trying to track if we're in between steps 2 and 3 just so data0 can spontaneously change. That would also disallow other sequences such as Tim's copy example where there is no step 3.

Dolu1990 commented 2 years ago

First, sorry for the long post XD

@pdonahue-ventana

Imagine that the answer is no and that data0 is allowed to spontaneously change.

There may be a missunderstanding, i didn't mean that it could spontaneously change, i mean : Should the returned value of a abstract command in dataX be reusable for the next abstract command.

In practice, is the following sequance ok or not :

abstract read command that writes data0
abstract write command that reads data0 (that was written by the abstract command 1)

@pdonahue-ventana + @timsifive

Mostly, from a more abstract perspective, I think it is about "Do we consider dataX to be registers/memory for abstract commands" or "Do we consider dataX as data channels for abstract commands"

If we consider dataX as data channel, it means that they are used by the debuger as a way to write abstract command arguements, and read abstract commands returned values.

How does it help you be require the debugger to always write data0 for an abstract command which will use it?

I didn't mean that data0 is lost after each command which used it, i mean that the data0 returned by a abstract command can't be used as arguement of the nexts abstract commands.

I cannot imagine what optimization you can make if this isn't the case.

I think a drawing is better than 1000 words ^^

To give some context, i'm targeting a FPGA softcore (https://github.com/SpinalHDL/NaxRiscv) So the tradeoff compaired to a ASIC can be quite different. Here logic can be "very expensive".

Basicaly, i'm trying to keep the DM and the CPU as much decoupled as possible, so having the dataX storage done where they are readed.

To sum up, do we realy want "coherency A" and "coherency B" link as described in the following diagram :

Coherency B being what my first post was about.
Coherency A being, if you write data0 using the DMI interface, should you be able to read back that value on the DMI interface

Dolu1990 commented 2 years ago

So, in short :

Should values returned by a abstract commands, be visible by in future abstract commands arguments (coherency B)
Should DMI dataX write be visible by future DMI dataX reads (coherency A)

timsifive commented 2 years ago

Thanks. I hadn't considered FPGAs where single port RAM can be much cheaper than a proper register, and the 32-bit wide mux that would be required to support might be expensive.

There's another issue in this design. DMI writes to data0 followed by DMI reads from data0 won't return the value that was just written. Like having abstract commands use the dataX results written by the previous command, this is not something I'd really expect a debugger to do.

It wouldn't be terrible to allow such implementations, but it does make the spec more complicated (what do you call a "register" that acts this way) and it's also not backwards compatible. Might there be abstract commands in the future where this behavior would be a problem? How helpful is this optimization really? Do you have a rough estimate of FPGA resources used by a debug implementation that does/does not use this optimization?

Dolu1990 commented 2 years ago

Thanks ^^ @timsifive

There's another issue in this design. DMI writes to data0 followed by DMI reads from data0 won't return the value that was just written.

Yes right, it is "solved" with the "coherency B" link in the second diagram (i think). It could be named an "issue", but from some perspetive it could be named a "feature", allowing eventualy to save hardware space.

what do you call a "register" that acts this way

Right, it would not be called a register (assuming register is synonym of flip flop), but instead it would more be specified as a interface to :

Write the arguments values
Read abstract command response

Where abstract command response may override some the written arguements (implementation specific)

Might there be abstract commands in the future where this behavior would be a problem?

Hard to tell for me, started reading the spec only a few days ago ^^

How helpful is this optimization really? Do you have a rough estimate of FPGA resources used by a debug implementation that does/does not use this optimization?

Two benefits : 1) For multiple harts configs interfaced from a single DM, it could allow a better decoupling between the DM and the harts, as it remove the notion of centralized memory ( in the current spec, both the harts and the DM have read / write access to that memory ). 2) Area in FPGA softcore. So, if we resume this issue to "Should you implement coherency A and B" Then it is around 80 LUT ( Cyclone IV / ECP5 FPGA)

(see https://github.com/SpinalHDL/VexRiscv#area-usage-and-maximal-frequency for some softcore size)

So, 80 luts in itself may be a bearable cost for a linux capable softcore, but it is noticable for small baremetal ones. Also, if many bearable cost accumulates, it endup producing oversized designs.

Here is an example of an additional element of the spec which may add hardware for little value (as far as i understand it): progbuf0-15 are read/write registers, which in practice (i guess), serve no real purpose ?

timsifive commented 2 years ago

So, 80 luts in itself may be a bearable cost for a linux capable softcore, but it is noticable for small baremetal ones. Also, if many bearable cost accumulates, it endup producing oversized designs.

Thanks for digging that up.

Here is an example of an additional element of the spec which may add hardware for little value (as far as i understand it): progbuf0-15 are read/write registers, which in practice (i guess), serve no real purpose ?

Typical implementations only need 2 of them, and that's allowed by the spec. But it's true that there's not much use for these registers being read/write. Is that what you meant?

What FPGA primitives are you using in the end? Doing some quick Googling, it appears that Xilinx Block RAMs all have 2 read/write ports. With that, can't you implement registers in the block RAM as the spec says?

Dolu1990 commented 2 years ago

Typical implementations only need 2 of them, and that's allowed by the spec. But it's true that there's not much use for these registers being read/write. Is that what you meant?

Yes :)

What FPGA primitives are you using in the end?

Currently i'm targeting Xilinx Artix 7 FPGA, but in the future, Altera Cyclone V and Lattice ECP 5 too. (All the FPGA with distributed ram)

it appears that Xilinx Block RAMs all have 2 read/write ports. With that, can't you implement registers in the block RAM as the spec says?

Yes, you are right, but for a few reasons i'm trying to avoid it :

For instance, with a Artix 7 FPGA, the smallest block ram you can get is 2KB. Also, in true dual port mode, it only allows 18 bits of data width, so, you will need two of them (if you want to avoid adding some additional multi cycle logic). So overall it would be 4 KB of block ram used, which is much, especialy considered we only realy need a few bytes. That's why i had in mind to use distributed ram instead (Lut based ram, no flip-flop/block-ram involved).
Using true dual port ram works in the case you have 1 DM for 1 CPU, but in the case you have 1 DM for multiple CPU, the issue come back.

Once everything works in simulation, i will give some synthesis run, to see how much ressources are used by the whole debug infrastructure, will post it here, as it can be interresting.

Anyway, thanks for the issue tracking ^^ Have a good weekend !

Dolu1990 commented 2 years ago

Got it to work in simulation (GDB >> openocd >> 4 pin jtag from a TCP socket >> verilator testbench >> DTM >> DMI >> DM >> NaxRiscv) Total size of the added hardware = 436 Artix 7 LUT, GDB seems ok, a few things aren't implemented, as reading the progbuff from the DMI and the coherency A/B for dataX (see diagrams in the previous post), as well as most optional features.

To be honnest, i was expecting the whole thing to use more resources ^^

Dolu1990 commented 2 years ago

Got openocd/GDB to work for both rv32/rv64. Here is my "final" config :

Implement access register to x0-x31 only
Implement programm buffer for everything else
Only provide enough dataX storage for abstract command arg0 / return values
No coherency for DM dataX write to DM dataX read
No coherency for abstract commands return to abstract commands arg0
Write only progbuf

Thanks for the help ^^

timsifive commented 2 years ago

I'd like to keep this open until we figure out for sure that what you did is allowed by the spec. It's a good optimization that I'd like to allow, but that is a non-backwards-compatible change. Hopefully we can discuss it at tomorrow's meeting.

pdonahue-ventana commented 2 years ago

This was addressed by #728 and #731.

timsifive commented 1 year ago

The Architecture Review Committee recommends we drop Message Registers from the spec, or do a bunch of extra work . In the interest of finally getting 1.0 ratified, we're dropping Message Registers. Sorry if this leads to problems for your implementation. Current OpenOCD does not assume it can read back any values it wrote to the data argument registers, so it should work well for you.

Dolu1990 commented 1 year ago

Thanks for the update :)

riscv / riscv-debug-spec

DebugModule.dataX preservation #717