[rv_dm] dm/dmi sometimes get stuck when issuing an abstract command

pamaury commented 1 year ago

Description

This issue has been investigated as part of #17729. In short, when issuing an abstract commands to read a register, and in yet-to-be-understood circumstances, the rv_dm module seems to get stuck and the jtag output is stuck to 1. It is still unclear which components gets stuck exactly, it looks like the DMI might be working but the DM is not responding for a while so it is just sending back some 1s. A typical error trace looks like this:

Debug: 12382 1056 riscv-013.c:783 execute_abstract_command(): command=0x22100e; access register, size=32, postexec=0, transfer=1, write=0, regno=0x100e
Debug: 12466 1057 riscv-013.c:397 scan(): 41b w 0022100e @17 -> + fffffffe @7f; 2i
Debug: 12550 1057 riscv-013.c:397 scan(): 41b r 00000000 @16 -> b ffffffff @7f; 2i
Debug: 12551 1057 riscv-013.c:460 increase_dmi_busy_delay(): dtmcs_idle=1, dmi_busy_delay=3, ac_busy_delay=0
Debug: 12661 1058 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0xffffffff
Debug: 12715 1058 riscv-013.c:397 scan(): 41b r 00000000 @16 -> b ffffffff @7f; 3i
Debug: 12716 1058 riscv-013.c:460 increase_dmi_busy_delay(): dtmcs_idle=1, dmi_busy_delay=4, ac_busy_delay=0
Debug: 12826 1059 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0xffffffff
Debug: 12880 1059 riscv-013.c:397 scan(): 41b r 00000000 @16 -> ? 413d5213 @00; 4i
Error: 12881 1059 riscv-013.c:606 dmi_op_timeout(): failed read at 0x16, status=1
Debug: 12991 1060 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0x1071
Debug: 12992 1060 riscv-013.c:806 execute_abstract_command(): command 0x22100e failed; abstractcs=0x0
Debug: 13076 1060 riscv-013.c:397 scan(): 41b w 00000700 @16 -> + 00000000 @00; 4i
Debug: 13077 1060 riscv-013.c:407 scan():  cmderr=7 -> 
Debug: 13131 1060 riscv-013.c:397 scan(): 41b - 00000000 @16 -> + 00000700 @16; 4i
Debug: 13132 1060 riscv-013.c:407 scan():  ->  cmderr=7
Debug: 13133 1060 riscv.c:3450 riscv_get_register(): [riscv.tap.0] a4: ffffffffffffffff
Debug: 13134 1060 gdb_server.c:1483 gdb_error(): Reporting -4 to GDB as generic error
Debug: 13135 1060 gdb_server.c:406 gdb_log_outgoing_packet(): [riscv.tap.0] sending packet: $E0E#ba

See #17729 for more traces.

When this happens, it looks like waiting for the DM to get unstuck and resetting the DMI allows debugging to continue.

andreaskurth commented 1 year ago

Thanks for creating this issue and the workaround in #18051, @pamaury :+1:

I was wondering if tdo gets sampled while undriven by the debug module but pulled up somehow? In all but the Shift states (Shift-IR and Shift-DR), the DMI JTAG TAP indicates that tdo should not be driven by assigning tdo_oe_o = 0. I don't remember off-hand how that is implemented on the FPGA. In any case, OpenOCD should not sample tdo outside the shift states -- do we know if something could be off there?