Closed engdoreis closed 1 year ago
Observations regarding rom_e2e_debug_test_otp_test_unlocked0
I tried to debug the problem: this tests fails consistently on fred. Step to get more debugging:
debug_level 3
to util/openocd/target/lowrisc_earlgrey.cfg
=> very verbose outputrules/script/gdb_test_coordinator.py
to keep logging after GDB exits:
# in __init__(): change this line to remove styling:
self.console = rich.console.Console() # no more colors
# in echo(), add:
if line == "":
return
# add method
def flush_all(self) -> None:
now = time.time_ns()
while time.time_ns() <= now + 1000000000:
self.maybe_print_output(1)
# in main:
while not background.empty():
....
if returncode != 0:
# wait a little bit so that we can flush the output of all processes
background.flush_all() # <--- this is new
sys.exit(returncode)
Conclusion: test fails when reading from RAM (x/16xb 0x10000000
). OpenOCD output:
Debug: 34789 6184 gdb_server.c:390 gdb_log_incoming_packet(): [riscv.tap.0] received packet: m10000000,1
Debug: 34790 6184 gdb_server.c:1524 gdb_read_memory_packet(): addr: 0x0000000010000000, len: 0x00000001
Debug: 34791 6184 target.c:2475 target_read_buffer(): reading buffer of 1 byte at 0x10000000
Debug: 34792 6184 riscv.c:3306 riscv_set_current_hartid(): setting hartid to 0, was 0
Debug: 34793 6184 riscv-013.c:4077 riscv013_get_register(): [riscv.tap.0] reading register priv
Debug: 34794 6184 riscv.c:3306 riscv_set_current_hartid(): setting hartid to 0, was 0
Debug: 34795 6184 riscv-013.c:782 execute_abstract_command(): command=0x2207b0; access register, size=32, postexec=0, transfer=1, write=0, regno=0x7b0
.... 013.c:407 scan(): -> sberror=7
Debug: 34832 6191 riscv-013.c:2850 log_mem_access_result(): Failed to read memory via system bus.
Error: 34833 6191 riscv-013.c:3598 read_memory(): Target riscv.tap.0: Failed to read memory (addr=0x10000000)
Error: 34834 6191 riscv-013.c:3599 read_memory(): progbuf=disabled, sysbus=failed, abstract=disabled
Debug: 34835 6191 gdb_server.c:1483 gdb_error(): Reporting -4 to GDB as generic error
Remark: reading a register works fine, for example:
echo ::: Read memory (LC_CTRL_STATE)
x/4xb 0x40140034
More observations: after setting sp
, GDB seems to want to read the stack all the time, but it fails because sp
contains a nonsense value. Also, reading the RAM before this step works, but fails after:
[GDB] ::: Read memory (LC_CTRL_STATE)
[GDB] 0x40140034: 0x21 0x84 0x10 0x02
[GDB] ::: Read RAM
[GDB] 0x10000000 <boot_measurements>: 0x10
[GDB] :::: Read and write GPRs.
[GDB] :::: Copy registers.::: Read RAM 2
[GDB] 0x10000000 <boot_measurements>: 0x10
[GDB] :::: Set registers.
[GDB] :::: Set zero to 0xdeadbeef.
[GDB] :::: Set ra to 0xdeadbeef.
[GDB] :::: Set sp to 0xdeadbeef.
[OPENOCD] Warn : Failed to read memory via system bus.
[OPENOCD] Error: Target riscv.tap.0: Failed to read memory (addr=0xdeadbf00)
[OPENOCD] Error: progbuf=disabled, sysbus=failed, abstract=disabled
[OPENOCD] Error: Target riscv.tap.0: Failed to read memory (addr=0xdeadbf0b)
[OPENOCD] Error: progbuf=disabled, sysbus=failed, abstract=disabled
[OPENOCD] Error: Target riscv.tap.0: Failed to read memory (addr=0xdeadbf00)
[OPENOCD] Error: progbuf=disabled, sysbus=failed, abstract=disabled
[OPENOCD] Error: Target riscv.tap.0: Failed to read memory (addr=0xdeadbf0b)
[OPENOCD] Error: progbuf=disabled, sysbus=failed, abstract=disabled
[OPENOCD] Error: Target riscv.tap.0: Failed to read memory (addr=0xdeadbf00)
[OPENOCD] Error: progbuf=disabled, sysbus=failed, abstract=disabled
[OPENOCD] Error: Target riscv.tap.0: Failed to read memory (addr=0xdeadbf0b)
[OPENOCD] Error: progbuf=disabled, sysbus=failed, abstract=disabled
[OPENOCD] Error: Target riscv.tap.0: Failed to read memory (addr=0xdeadbf00)
[OPENOCD] Error: progbuf=disabled, sysbus=failed, abstract=disabled
[OPENOCD] Error: Target riscv.tap.0: Failed to read memory (addr=0xdeadbf0b)
[GDB] :::: Set gp to 0xdeadbeef.
I was able to confirm independently (using the OpenOCD jtag struct in opentitanlib) that once a read using the sysbus fails, all subsequence reads seem to fail. I don't know yet if this is an OpenOCD problem or a hardware problem.
Couple things:
e2e_chip_specific_startup_rma_fpga_cw310_rom
could be causing similar issues, since this test does not use the JTAG/OpenOCD/GDB infra.I have only investigated rom_e2e_debug_test_otp_test_unlocked0
so far. In my tests, it has not worked a single time yet (ie no flakiness at all, just plain does not work), both for rom_e2e_debug_test_otp_test_unlocked0
and with opencdo+jtag+opentitanlib. I have also tried with OpenOCD 0.12 (we are still using -rc1 until #17646 ) gets merged, same result.
Here is an interesting trace using jtag+openocd directly:
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 480 1431 command.c:155 script_debug(): command - read_memory 0x10000000 8 4 phys
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 481 1431 riscv.c:2186 riscv_openocd_poll(): polling all harts
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 482 1431 riscv.c:3304 riscv_set_current_hartid(): setting hartid to 0, was 0
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 483 1431 riscv.c:2091 riscv_poll_hart(): polling hart 0, target->state=2
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 484 1432 riscv-013.c:397 scan(): 41b r 00000000 @11 -> + 00000000 @38; 2i
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 485 1433 riscv-013.c:397 scan(): 41b - 00000000 @11 -> + 00030382 @11; 2i
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 486 1433 riscv-013.c:407 scan(): -> allresumeack anyresumeack allhalted anyhalted authenticated version=2
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 487 1433 riscv.c:3304 riscv_set_current_hartid(): setting hartid to 0, was 0
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 488 1434 riscv-013.c:397 scan(): 41b w 00118000 @38 -> + 00000000 @11; 2i
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 489 1434 riscv-013.c:407 scan(): sbreadonaddr sbautoincrement sbreadondata ->
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 490 1434 riscv-013.c:397 scan(): 41b - 00000000 @38 -> + 00118000 @38; 2i
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 491 1434 riscv-013.c:407 scan(): -> sbreadonaddr sbautoincrement sbreadondata
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 492 1435 riscv-013.c:397 scan(): 41b w 10000000 @39 -> + 00000000 @38; 2i
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 493 1435 riscv-013.c:397 scan(): 41b - 00000000 @39 -> + 10000000 @39; 2i
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 494 1435 riscv-013.c:397 scan(): 41b r 00000000 @3c -> + 00000000 @39; 2i
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 495 1436 riscv-013.c:397 scan(): 41b r 00000000 @3c -> + 30aad985 @3c; 2i
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 496 1436 riscv-013.c:2507 log_memory_access(): M[0x10000000] reads 0x85
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 497 1436 riscv-013.c:397 scan(): 41b r 00000000 @3c -> + 0030aad9 @3c; 2i
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 498 1436 riscv-013.c:2507 log_memory_access(): M[0x10000001] reads 0xd9
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 499 1436 riscv-013.c:397 scan(): 41b - 00000000 @00 -> + 000030aa @3c; 2i
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 500 1436 riscv-013.c:2507 log_memory_access(): M[0x10000002] reads 0xaa
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 501 1437 riscv-013.c:397 scan(): 41b r 00000000 @38 -> + 00000000 @00; 2i
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 502 1437 riscv-013.c:397 scan(): 41b - 00000000 @38 -> + 20118407 @38; 2i
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 503 1437 riscv-013.c:407 scan(): -> sbversion=1 sbreadonaddr sbautoincrement sbreadondata sbasize=32 sbaccess32 sbaccess16 sbaccess8
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 504 1438 riscv-013.c:397 scan(): 41b w 00110000 @38 -> + 00000000 @38; 2i
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 505 1438 riscv-013.c:407 scan(): sbreadonaddr sbautoincrement ->
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 506 1438 riscv-013.c:397 scan(): 41b - 00000000 @38 -> + 00110000 @38; 2i
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 507 1438 riscv-013.c:407 scan(): -> sbreadonaddr sbautoincrement
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 508 1438 riscv-013.c:397 scan(): 41b r 00000000 @3c -> + 00000000 @38; 2i
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 509 1439 riscv-013.c:397 scan(): 41b - 00000000 @3c -> + 00000030 @3c; 2i
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 510 1439 riscv-013.c:2507 log_memory_access(): M[0x10000003] reads 0x30
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 511 1439 riscv-013.c:397 scan(): 41b r 00000000 @38 -> + 00000000 @3c; 2i
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 512 1440 riscv-013.c:397 scan(): 41b - 00000000 @38 -> + 20110407 @38; 2i
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 513 1440 riscv-013.c:407 scan(): -> sbversion=1 sbreadonaddr sbautoincrement sbasize=32 sbaccess32 sbaccess16 sbaccess8
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 514 1440 riscv-013.c:2848 log_mem_access_result(): Succeeded to read memory via system bus.
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] User : 515 1440 options.c:52 configuration_output_handler(): 0x85 0xd9 0xaa 0x30User : 516 1440 options.c:52 configuration_output_handler():
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 517 1483 command.c:155 script_debug(): command - read_memory 0xdeadbeef 32 1 phys
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 518 1484 riscv.c:2186 riscv_openocd_poll(): polling all harts
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 519 1484 riscv.c:3304 riscv_set_current_hartid(): setting hartid to 0, was 0
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 520 1484 riscv.c:2091 riscv_poll_hart(): polling hart 0, target->state=2
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 521 1484 riscv-013.c:397 scan(): 41b r 00000000 @11 -> + 00000000 @38; 2i
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 522 1484 riscv-013.c:397 scan(): 41b - 00000000 @11 -> + 00030382 @11; 2i
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 523 1484 riscv-013.c:407 scan(): -> allresumeack anyresumeack allhalted anyhalted authenticated version=2
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 524 1484 riscv.c:3304 riscv_set_current_hartid(): setting hartid to 0, was 0
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 525 1485 riscv-013.c:397 scan(): 41b w 00150000 @38 -> + 00000000 @11; 2i
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 526 1485 riscv-013.c:407 scan(): sbreadonaddr sbaccess=2 sbautoincrement ->
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 527 1485 riscv-013.c:397 scan(): 41b - 00000000 @38 -> + 00150000 @38; 2i
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 528 1485 riscv-013.c:407 scan(): -> sbreadonaddr sbaccess=2 sbautoincrement
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 529 1486 riscv-013.c:397 scan(): 41b w deadbeef @39 -> + 00000000 @38; 2i
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 530 1486 riscv-013.c:397 scan(): 41b - 00000000 @39 -> + deadbeef @39; 2i
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 531 1486 riscv-013.c:397 scan(): 41b r 00000000 @3c -> + 00000000 @39; 2i
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 532 1487 riscv-013.c:397 scan(): 41b - 00000000 @3c -> + 00000030 @3c; 2i
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 533 1487 riscv-013.c:2507 log_memory_access(): M[0xdeadbeef] reads 0x00000030
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 534 1487 riscv-013.c:397 scan(): 41b r 00000000 @38 -> + 00000000 @3c; 2i
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 535 1488 riscv-013.c:397 scan(): 41b - 00000000 @38 -> + 20153407 @38; 2i
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 536 1488 riscv-013.c:407 scan(): -> sbversion=1 sbreadonaddr sbaccess=2 sbautoincrement sberror=3 sbasize=32 sbaccess32 sbaccess16 sbaccess8
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 537 1488 riscv-013.c:397 scan(): 41b w 00007000 @38 -> + 00000000 @38; 2i
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 538 1488 riscv-013.c:407 scan(): sberror=7 ->
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 539 1488 riscv-013.c:397 scan(): 41b - 00000000 @38 -> + 00007000 @38; 2i
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Debug: 540 1488 riscv-013.c:407 scan(): -> sberror=7
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Warn : 541 1488 riscv-013.c:2846 log_mem_access_result(): Failed to read memory via system bus.
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Error: 542 1488 riscv-013.c:3596 read_memory(): Target riscv.tap.0: Failed to read memory (addr=0xdeadbeef)
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Error: 543 1488 riscv-013.c:3597 read_memory(): progbuf=disabled, sysbus=failed, abstract=disabled
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] Error: 544 1488 target.c:4707 target_jim_read_memory(): read_memory: read at 0xdeadbeef with width=32 and count=1 failed
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr] User : 545 1488 command.c:608 command_run_line(): read_memory: failed to read memory
[2023-03-29T16:13:10Z INFO opentitanlib::util::openocd::stderr]
We can see that right the bad read, sberror=3
(alignment error, this is correct). After that, openocd tries to clear the rror with sberror=7
(sberror
is r/w1c
) but then sberror=7
sticks which is clearly wrong...
Looking at /home/pamaury/project/opentitan/hw/vendor/pulp_riscv_dbg/src/dm_csrs.sv
, line 460
sbcs_d.sberror = sbcs_q.sberror & {3{~(sbcs.sberror == 3'd1)}};
This seems suspicious, it seems to clear the errors when writing 1, instead of 7 (as per the spec). Spec says:
When the Debug Module’s system bus master en-
counters an error, this field gets set. The bits in
this field remain set until they are cleared by writ-
ing 1 to them. While this field is non-zero, no
more system bus accesses can be initiated by the
Debug Module
Need a verilog expert to confirm.
Looking at the current RTL, the update mechanism for the SBCS CSR is:
sberror
field of the DMI request is equal to 001
, andsberror
field of the SBCS CSR with 000
to clear it.I think the issue is that the DMI request data bits expected should be 111
(i.e., 7), as this is what OpenOCD sends, and is the what matches the spec (as @pamaury points out).
So I think this line should be:
sbcs_d.sberror = sbcs_q.sberror & {3{~(sbcs.sberror == 3'b111)}};
instead of:
sbcs_d.sberror = sbcs_q.sberror & {3{~(sbcs.sberror == 3'd1)}};
Looking at the current RTL, the update mechanism for the SBCS CSR is:
- check if the
sberror
field of the DMI request is equal to001
, and- if so, AND the
sberror
field of the SBCS CSR with000
to clear it.I think the issue is that the DMI request data bits expected should be
111
(i.e., 7), as this is what OpenOCD sends, and is the what matches the spec (as @pamaury points out).So I think this line should be:
sbcs_d.sberror = sbcs_q.sberror & {3{~(sbcs.sberror == 3'b111)}};
instead of:
sbcs_d.sberror = sbcs_q.sberror & {3{~(sbcs.sberror == 3'd1)}};
Or perhaps just...
sbcs_d.sberror = sbcs_q.sberror & ~sbcs.sberror;
The original author may have misinterpreted the meaning of "write 1 to clear" :)
well the problem with:
sbcs_d.sberror = sbcs_q.sberror & ~sbcs.sberror;
is it could cause partial clearing of the sberror
field.
Which may not be an issue, the spec is unclear from what I can tell, so perhaps this ok? The reason I doubt it would be ok though, is because the error code uses all bits, so a partial clearing would change the error code meaning.
I submitted an issues on the matter here: https://github.com/pulp-platform/riscv-dbg/issues/154#issue-1646309814
CC: @msfschaffner @andreaskurth
well the problem with:
sbcs_d.sberror = sbcs_q.sberror & ~sbcs.sberror;
is it could cause partial clearing of the
sberror
field.Which may not be an issue, the spec is unclear from what I can tell, so perhaps this ok?
It's a funky trade-off, and the spec should probably clarify that... Writing 3'b111
is really the only right thing to do from the host side, since this isn't a bit map or one-hot encoding, but on the OT side, rejecting writes that only explicitly clear the set bits also doesn't seem right.
A different approach might be...
// If we clear one, we clear all
sbcs_d.sberror = sbcs_q.sberror & ~{3{|sbcs.sberror}};
yep, agreed, either works, I think whichever mechanism just needs to clear all bits of the field.
Thanks for flagging and investigating this; I'll look into this on Friday (I'm OoO tomorrow)
sg, thanks @andreaskurth !
@andreaskurth Should the spec be clarified in this respect? I don't know if one can just do a PR on this spec repo.
I don't think the spec should be changed but the RTL should, see my response on the upstream issue. Let's give the upstream maintainers a bit of time to consider this. If we don't get a reply, we can patch it in this repo
More tests:
debug_disallowed_in_prod_fpga_cw310_test_otp_rma
, rom_e2e_asm_watchdog_bark_fpga_cw310_test_otp_rma
and rom_e2e_asm_watchdog_bark_fpga_cw310_test_otp_rma
are currently passing on fred every time...rom_e2e_asm_watchdog_bite_fpga_cw310_test_otp_dev
followed by a success. The log are not enough to really understand the failure. It was not a USB error but rather the gdb script was setting up a breakpoint but it is never triggered. It was not possible to extract from the log wether this was a hardware problem or just the software followed an unexpected execution flow.sram_program_fpga_cw310_test_otp_test_unlocked0
seems to fail because of the OpenOCD issue above, but there is a twist worth investigating: it set sp
to 0x10020000
so gdb will read the last 64 bytes of stack by issues a read of 64 bytes at 0x1001ffc0
which should be fine. However this fails when reaching the address 0x1001ffcc
with sberror=7
:
[OPENOCD] Debug: 31258 5441 gdb_server.c:390 gdb_log_incoming_packet(): [riscv.tap.0] received packet: m1001ffc0,40
[OPENOCD] Debug: 31259 5441 gdb_server.c:1524 gdb_read_memory_packet(): addr: 0x000000001001ffc0, len: 0x00000040
[OPENOCD] Debug: 31260 5441 target.c:2475 target_read_buffer(): reading buffer of 64 byte at 0x1001ffc0
[OPENOCD] Debug: 31261 5441 riscv.c:3304 riscv_set_current_hartid(): setting hartid to 0, was 0
[OPENOCD] Debug: 31262 5441 riscv-013.c:4075 riscv013_get_register(): [riscv.tap.0] reading register priv
[OPENOCD] Debug: 31263 5441 riscv.c:3304 riscv_set_current_hartid(): setting hartid to 0, was 0
[OPENOCD] Debug: 31264 5441 riscv-013.c:782 execute_abstract_command(): command=0x2207b0; access register, size=32, postexec=0, transfer=1, write=0, regno=0x7b0
[OPENOCD] Debug: 31265 5441 riscv-013.c:397 scan(): 41b w 002207b0 @17 -> + 00000000 @38; 2i
[OPENOCD] Debug: 31266 5442 riscv-013.c:397 scan(): 41b r 00000000 @16 -> + 002207b0 @17; 2i
[OPENOCD] Debug: 31267 5442 riscv-013.c:397 scan(): 41b - 00000000 @16 -> + 08000002 @16; 2i
[OPENOCD] Debug: 31268 5442 riscv-013.c:407 scan(): -> progbufsize=8 datacount=2
[OPENOCD] Debug: 31269 5443 riscv-013.c:397 scan(): 41b r 00000000 @04 -> + 00000000 @16; 2i
[OPENOCD] Debug: 31270 5443 riscv-013.c:397 scan(): 41b - 00000000 @04 -> + 400000c3 @04; 2i
[OPENOCD] Debug: 31271 5443 riscv-013.c:1492 register_read_direct(): {0} dcsr = 0x400000c3
[OPENOCD] Debug: 31272 5443 riscv.c:3450 riscv_get_register(): [riscv.tap.0] priv: 3
[OPENOCD] Debug: 31273 5443 riscv.c:3433 riscv_get_register(): [riscv.tap.0] mstatus: 80 (cached)
[OPENOCD] Debug: 31274 5443 riscv.c:1546 riscv_mmu(): SATP/MMU ignored in Machine mode (mstatus=0x80).
[OPENOCD] Debug: 31275 5443 riscv-013.c:397 scan(): 41b w 00158000 @38 -> + 00000000 @04; 2i
[OPENOCD] Debug: 31276 5443 riscv-013.c:407 scan(): sbreadonaddr sbaccess=2 sbautoincrement sbreadondata ->
[OPENOCD] Debug: 31277 5444 riscv-013.c:397 scan(): 41b - 00000000 @38 -> + 00158000 @38; 2i
[OPENOCD] Debug: 31278 5444 riscv-013.c:407 scan(): -> sbreadonaddr sbaccess=2 sbautoincrement sbreadondata
[OPENOCD] Debug: 31279 5444 riscv-013.c:397 scan(): 41b w 1001ffc0 @39 -> + 00000000 @38; 2i
[OPENOCD] Debug: 31280 5444 riscv-013.c:397 scan(): 41b - 00000000 @39 -> + 1001ffc0 @39; 2i
[OPENOCD] Debug: 31281 5445 riscv-013.c:397 scan(): 41b r 00000000 @3c -> + 00000000 @39; 2i
[OPENOCD] Debug: 31282 5445 riscv-013.c:397 scan(): 41b r 00000000 @3c -> + e58d9c61 @3c; 2i
[OPENOCD] Debug: 31283 5445 riscv-013.c:2507 log_memory_access(): M[0x1001ffc0] reads 0xe58d9c61
[OPENOCD] Debug: 31284 5445 riscv-013.c:397 scan(): 41b r 00000000 @3c -> + e58d9c61 @3c; 2i
[OPENOCD] Debug: 31285 5445 riscv-013.c:2507 log_memory_access(): M[0x1001ffc4] reads 0xe58d9c61
[OPENOCD] Debug: 31286 5446 riscv-013.c:397 scan(): 41b r 00000000 @3c -> + e58d9c61 @3c; 2i
[OPENOCD] Debug: 31287 5446 riscv-013.c:2507 log_memory_access(): M[0x1001ffc8] reads 0xe58d9c61
[OPENOCD] Debug: 31288 5446 riscv-013.c:397 scan(): 41b r 00000000 @3c -> + e58d9c61 @3c; 2i
[OPENOCD] Debug: 31289 5446 riscv-013.c:2507 log_memory_access(): M[0x1001ffcc] reads 0xe58d9c61
[OPENOCD] Debug: 31290 5446 riscv-013.c:397 scan(): 41b r 00000000 @3c -> + e58d9c61 @3c; 2i
[OPENOCD] Debug: 31291 5446 riscv-013.c:2507 log_memory_access(): M[0x1001ffd0] reads 0xe58d9c61
[OPENOCD] Debug: 31292 5447 riscv-013.c:397 scan(): 41b r 00000000 @3c -> + e58d9c61 @3c; 2i
[OPENOCD] Debug: 31293 5447 riscv-013.c:2507 log_memory_access(): M[0x1001ffd4] reads 0xe58d9c61
[OPENOCD] Debug: 31294 5447 riscv-013.c:397 scan(): 41b r 00000000 @3c -> + e58d9c61 @3c; 2i
[OPENOCD] Debug: 31295 5447 riscv-013.c:2507 log_memory_access(): M[0x1001ffd8] reads 0xe58d9c61
[OPENOCD] Debug: 31296 5447 riscv-013.c:397 scan(): 41b r 00000000 @3c -> + e58d9c61 @3c; 2i
[OPENOCD] Debug: 31297 5447 riscv-013.c:2507 log_memory_access(): M[0x1001ffdc] reads 0xe58d9c61
[OPENOCD] Debug: 31298 5448 riscv-013.c:397 scan(): 41b r 00000000 @3c -> + e58d9c61 @3c; 2i
[OPENOCD] Debug: 31299 5448 riscv-013.c:2507 log_memory_access(): M[0x1001ffe0] reads 0xe58d9c61
[OPENOCD] Debug: 31300 5448 riscv-013.c:397 scan(): 41b r 00000000 @3c -> + e58d9c61 @3c; 2i
[OPENOCD] Debug: 31301 5448 riscv-013.c:2507 log_memory_access(): M[0x1001ffe4] reads 0xe58d9c61
[OPENOCD] Debug: 31302 5448 riscv-013.c:397 scan(): 41b r 00000000 @3c -> + e58d9c61 @3c; 2i
[OPENOCD] Debug: 31303 5448 riscv-013.c:2507 log_memory_access(): M[0x1001ffe8] reads 0xe58d9c61
[OPENOCD] Debug: 31304 5449 riscv-013.c:397 scan(): 41b r 00000000 @3c -> + e58d9c61 @3c; 2i
[OPENOCD] Debug: 31305 5449 riscv-013.c:2507 log_memory_access(): M[0x1001ffec] reads 0xe58d9c61
[OPENOCD] Debug: 31306 5449 riscv-013.c:397 scan(): 41b r 00000000 @3c -> + e58d9c61 @3c; 2i
[OPENOCD] Debug: 31307 5449 riscv-013.c:2507 log_memory_access(): M[0x1001fff0] reads 0xe58d9c61
[OPENOCD] Debug: 31308 5449 riscv-013.c:397 scan(): 41b r 00000000 @3c -> + e58d9c61 @3c; 2i
[OPENOCD] Debug: 31309 5449 riscv-013.c:2507 log_memory_access(): M[0x1001fff4] reads 0xe58d9c61
[OPENOCD] Debug: 31310 5450 riscv-013.c:397 scan(): 41b - 00000000 @00 -> + e58d9c61 @3c; 2i
[OPENOCD] Debug: 31311 5450 riscv-013.c:2507 log_memory_access(): M[0x1001fff8] reads 0xe58d9c61
[OPENOCD] Debug: 31312 5450 riscv-013.c:397 scan(): 41b r 00000000 @38 -> + 00000000 @00; 2i
[OPENOCD] Debug: 31313 5450 riscv-013.c:397 scan(): 41b - 00000000 @38 -> + 2015f407 @38; 2i
[OPENOCD] Debug: 31314 5450 riscv-013.c:407 scan(): -> sbversion=1 sbreadonaddr sbaccess=2 sbautoincrement sbreadondata sberror=7 sbasize=32 sbaccess32 sbaccess16 sbaccess8
[OPENOCD] Debug: 31315 5451 riscv-013.c:397 scan(): 41b w 00150000 @38 -> + 00000000 @38; 2i
[OPENOCD] Debug: 31316 5451 riscv-013.c:407 scan(): sbreadonaddr sbaccess=2 sbautoincrement ->
[OPENOCD] Debug: 31317 5451 riscv-013.c:397 scan(): 41b - 00000000 @38 -> + 00150000 @38; 2i
[OPENOCD] Debug: 31318 5451 riscv-013.c:407 scan(): -> sbreadonaddr sbaccess=2 sbautoincrement
[OPENOCD] Debug: 31319 5451 riscv-013.c:397 scan(): 41b w 00007000 @38 -> + 00000000 @38; 2i
[OPENOCD] Debug: 31320 5451 riscv-013.c:407 scan(): sberror=7 ->
[OPENOCD] Debug: 31321 5452 riscv-013.c:397 scan(): 41b - 00000000 @38 -> + 00007000 @38; 2i
[OPENOCD] Debug: 31322 5452 riscv-013.c:407 scan(): -> sberror=7
[OPENOCD] Warn : 31323 5452 riscv-013.c:2846 log_mem_access_result(): Failed to read memory via system bus.
[OPENOCD] Error: 31324 5452 riscv-013.c:3596 read_memory(): Target riscv.tap.0: Failed to read memory (addr=0x1001ffc0)
[OPENOCD] Error: 31325 5452 riscv-013.c:3597 read_memory(): progbuf=disabled, sysbus=failed, abstract=disabled
Interestingly, just reading at 0x1001ffcc
works on its own when doing only with openocd and not sran/epmp setup.
New finding on sram_program_fpga_cw310_test_otp_test_unlocked0
: the following fails:
0x10000420
echo :::: Strange memory test\\n
monitor write_memory 0x10000420 32 { 0xdeadbeef } phys
# gap at 0x10000424
monitor write_memory 0x10000428 32 { 0xbeefdead } phys
echo :::: Read at 0x10000420\\n
monitor read_memory 0x10000420 32 1 phys
echo :::: Read at 0x10000428\\n
monitor read_memory 0x10000428 32 1 phys
echo :::: Read at 0x10000424\\n
monitor read_memory 0x10000424 32 1 phys
Conclusion: reading a memory location in the main RAM that was never written to fails!
This explains the test failure: GDB tries to read the top of the stack but sp
is pointing to the top of the RAM and the stack is not initialised.
This could be due to ECC checks failures because of memory scrambling (thanks @engdoreis for suggestion). To be confirmed.
PR to solve sram_program_fpga_cw310_test_otp_test_unlocked0
: #17815
I have done several manual runs of the nightly regression tests with more logging:
Summary:
sram_program_fpga_cw310_test_otp_dev
and rom_e2e_debug_test_otp_rma
failedasm_interrupt_handler_fpga_cw310_test_otp_test_unlocked0
and rom_e2e_debug_test_otp_test_unlocked0
are more randomThe failure looks random, there isn't a single cause: sometimes the hart is unexpectedly reset, something openocd simply doesn't manage to talk to the core.
I have done several manual runs of the nightly regression tests with more logging:
Assuming this is not resolved yet, can you summarize the logs?
I have ran a special "nightly" CI test where I repeated a limited selection of tests 10 times each to see what happens and I managed to get a fair number of failures: https://dev.azure.com/lowrisc/opentitan/_build/results?buildId=112397&view=logs&j=ca4fe378-ea04-5a89-5894-29913db80385&t=05ba4405-1aa2-5a0c-7d47-3e8a33bd2fee
Interestingly, many failures are similar and show this kind of pattern:
[OPENOCD] Error: failed read at 0x16, status=1
[OPENOCD] Info : dropped 'gdb' connection
[GDB] sw/device/silicon_creator/rom/e2e/rom_e2e_asm_watchdog_bark_fpga_cw310_test_otp_test_unlocked0.gdb:2: Error in sourced command file:
GDB exited with code 1
[GDB] Could not read registers; remote failure reply 'E0E'
I am not sure yet what it means but it seems to happen with registers 0x16 and 0x17, and also the remote failure is quite unusual. I plan to investigate this in more details tomorrow.
EDIT: I re-ran the test with a lot more debug, output is here and here is the relevant part:
2023-04-11T09:51:16.1727711Z [OPENOCD] Debug: 943 315 riscv-013.c:782 execute_abstract_command(): command=0x221009; access register, size=32, postexec=0, transfer=1, write=0, regno=0x1009
2023-04-11T09:51:16.1728045Z [OPENOCD] Debug: 944 316 riscv-013.c:397 scan(): 41b w 00221009 @17 -> + 00000000 @04; 2i
2023-04-11T09:51:16.1728343Z [OPENOCD] Debug: 945 316 riscv-013.c:397 scan(): 41b r 00000000 @16 -> F ffffffff @7f; 2i
2023-04-11T09:51:16.1728633Z [OPENOCD] Error: 946 316 riscv-013.c:606 dmi_op_timeout(): failed read at 0x16, status=2
2023-04-11T09:51:16.1728953Z [OPENOCD] Debug: 947 316 riscv-013.c:805 execute_abstract_command(): command 0x221009 failed; abstractcs=0x0
2023-04-11T09:51:16.1729264Z [OPENOCD] Debug: 948 316 riscv-013.c:397 scan(): 41b w 00000700 @16 -> b ffffffff @7f; 2i
2023-04-11T09:51:16.1729538Z [OPENOCD] Debug: 949 316 riscv-013.c:407 scan(): cmderr=7 ->
2023-04-11T09:51:16.1729845Z [OPENOCD] Debug: 950 316 riscv-013.c:460 increase_dmi_busy_delay(): dtmcs_idle=1, dmi_busy_delay=3, ac_busy_delay=0
2023-04-11T09:51:16.1730163Z [OPENOCD] Debug: 951 317 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0xffffffff
2023-04-11T09:51:16.1730461Z [OPENOCD] Debug: 952 317 riscv-013.c:397 scan(): 41b w 00000700 @16 -> b ffffffff @7f; 3i
2023-04-11T09:51:16.1730775Z [OPENOCD] Debug: 953 317 riscv-013.c:407 scan(): cmderr=7 ->
2023-04-11T09:51:16.1731157Z [OPENOCD] Debug: 954 317 riscv-013.c:460 increase_dmi_busy_delay(): dtmcs_idle=1, dmi_busy_delay=4, ac_busy_delay=0
2023-04-11T09:51:16.1731480Z [OPENOCD] Debug: 955 317 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0xffffffff
2023-04-11T09:51:16.1731884Z [OPENOCD] Debug: 956 318 riscv-013.c:397 scan(): 41b w 00000700 @16 -> b ffffffff @7f; 4i
2023-04-11T09:51:16.1732156Z [OPENOCD] Debug: 957 318 riscv-013.c:407 scan(): cmderr=7 ->
2023-04-11T09:51:16.1732465Z [OPENOCD] Debug: 958 318 riscv-013.c:460 increase_dmi_busy_delay(): dtmcs_idle=1, dmi_busy_delay=5, ac_busy_delay=0
2023-04-11T09:51:16.1732782Z [OPENOCD] Debug: 959 318 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0xffffffff
2023-04-11T09:51:16.1733080Z [OPENOCD] Debug: 960 318 riscv-013.c:397 scan(): 41b w 00000700 @16 -> b ffffffff @7f; 5i
2023-04-11T09:51:16.1733352Z [OPENOCD] Debug: 961 318 riscv-013.c:407 scan(): cmderr=7 ->
2023-04-11T09:51:16.1733657Z [OPENOCD] Debug: 962 318 riscv-013.c:460 increase_dmi_busy_delay(): dtmcs_idle=1, dmi_busy_delay=6, ac_busy_delay=0
2023-04-11T09:51:16.1733970Z [OPENOCD] Debug: 963 319 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0xffffffff
2023-04-11T09:51:16.1734270Z [OPENOCD] Debug: 964 319 riscv-013.c:397 scan(): 41b w 00000700 @16 -> b ffffffff @7f; 6i
2023-04-11T09:51:16.1734543Z [OPENOCD] Debug: 965 319 riscv-013.c:407 scan(): cmderr=7 ->
2023-04-11T09:51:16.1734845Z [OPENOCD] Debug: 966 319 riscv-013.c:460 increase_dmi_busy_delay(): dtmcs_idle=1, dmi_busy_delay=7, ac_busy_delay=0
2023-04-11T09:51:16.1735159Z [OPENOCD] Debug: 967 319 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0x1071
2023-04-11T09:51:16.1735457Z [OPENOCD] Debug: 968 320 riscv-013.c:397 scan(): 41b w 00000700 @16 -> + 00000000 @00; 7i
2023-04-11T09:51:16.1735729Z [OPENOCD] Debug: 969 320 riscv-013.c:407 scan(): cmderr=7 ->
2023-04-11T09:51:16.1736013Z [OPENOCD] Debug: 970 320 riscv-013.c:397 scan(): 41b - 00000000 @16 -> + 00000700 @16; 7i
2023-04-11T09:51:16.1736285Z [OPENOCD] Debug: 971 320 riscv-013.c:407 scan(): -> cmderr=7
2023-04-11T09:51:16.1736576Z [OPENOCD] Debug: 972 320 riscv.c:3452 riscv_get_register(): [riscv.tap.0] s1: ffffffffffffffff
2023-04-11T09:51:16.1736884Z [OPENOCD] Debug: 973 320 gdb_server.c:1483 gdb_error(): Reporting -4 to GDB as generic error
Analysis: a command fails for some reason, sets cmderr
to 7
, then OpenOCD tries to clear the error by writing 7
to it as per the spec but this does nothing. @andreaskurth could you have a look at this?
Line 1: Abstract command for a register read of 32 bit from GPR x9
= s1
.
(Lines 2 and 3: I don't know the code for scan()
, what exactly do those mean?)
Line 4: Read from Abstract Control and Status (abstractcs
) register fails in dmi_op_timeout()
-- what does status=2
mean? That dtmcs.dmistat
is 2
? If so, dmi.op
will be stuck at 2 until the DMI is reset with W1 to dtmcs.dmireset
. Does that happen?
Lines 6 ff are repeated tries to W1C the cmderr
field of abstractcs
, which fail. I think that happens because the abstract command is still busy, in which case clearing the cmderr
register is not allowed:
https://github.com/lowRISC/opentitan/blob/ae6e5adf94567db9a0233e96acbf8b7e869b096f/hw/vendor/pulp_riscv_dbg/src/dm_csrs.sv?plain=1#L397-L404
Do we observe the dm::DTM_BUSY
(0x3
) response?
So I think:
x9
) on line 1 fails, but that can happen. We may want to investigate this further, but I think that's second priority.dmireset
. Does that happen? If not, why not?The code is essentially here and there. The scan
means that it pushes the bits through the jtag chain. Looking at it and the log, my understanding is:
Debug: 944 316 riscv-013.c:397 scan(): 41b w 00221009 @17 -> + 00000000 @04; 2i
write 0x00221009
to register 0x17
(command
, Abstract Command)
Debug: 945 316 riscv-013.c:397 scan(): 41b r 00000000 @16 -> F ffffffff @7f; 2i
Error: 946 316 riscv-013.c:606 dmi_op_timeout(): failed read at 0x16, status=2
Debug: 947 316 riscv-013.c:805 execute_abstract_command(): command 0x221009 failed; abstractcs=0x0
read register 0x16
(abstractcs
), get 0xffffffff
in register 0x7f
, on success we would have expected to see the - 00221009 @17
to notify success of the previous command but instead we get op=2
(error) in dmi
register of the DTM which means that the previous command failed. Importantly openOCD does not seem to set dmireset
at this point.
Debug: 948 316 riscv-013.c:397 scan(): 41b w 00000700 @16 -> b ffffffff @7f; 2i
Debug: 949 316 riscv-013.c:407 scan(): cmderr=7 ->
Debug: 950 316 riscv-013.c:460 increase_dmi_busy_delay(): dtmcs_idle=1, dmi_busy_delay=3, ac_busy_delay=0
Debug: 951 317 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0xffffffff
OpenOCD tries to write 0x700
to abstractcs
, probably to clear any error. The DTM reports a busy status (b
) so OpenOCD increases its timeout (the 2i
will now recome 3i
) and sets dmireset
in dtmcs
. At this point the DTM answers with 0xffffffff
which I am not sure is expected.
Debug: 952 317 riscv-013.c:397 scan(): 41b w 00000700 @16 -> b ffffffff @7f; 3i
Debug: 953 317 riscv-013.c:407 scan(): cmderr=7 ->
Debug: 954 317 riscv-013.c:460 increase_dmi_busy_delay(): dtmcs_idle=1, dmi_busy_delay=4, ac_busy_delay=0
Debug: 955 317 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0xffffffff
Same routine, openocd tries to clear cmerr
but the DTM is still busy. This process is repeat a few times and eventually the DTM responds to the reset:
Debug: 967 319 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0x1071
and OpenOCD tries to clear the error again
Debug: 968 320 riscv-013.c:397 scan(): 41b w 00000700 @16 -> + 00000000 @00; 7i
Debug: 969 320 riscv-013.c:407 scan(): cmderr=7 ->
Debug: 970 320 riscv-013.c:397 scan(): 41b - 00000000 @16 -> + 00000700 @16; 7i
Debug: 971 320 riscv-013.c:407 scan(): -> cmderr=7
but this seems to have no effect. Note that the DM answered + 00000700 @16;
so it is not busy but cmderr
is still 7.
There are a few unclear points though:
dmireset
after the first failure, but it only seems to do it after the first busycmderr
has no effect even though the DM does not seem to report a busy state?I did another run here: https://dev.azure.com/lowrisc/opentitan/_build/results?buildId=112650&view=results with #17840 which contains an OpenOCD patch to dmireset
on error (could be useful, not sure if we want to upstream that). Interestingly, I got a "clean" failure without the error but just the busy, again on the same command:
2023-04-12T11:34:46.9256871Z [OPENOCD] Debug: 1135 303 riscv-013.c:783 execute_abstract_command(): command=0x221019; access register, size=32, postexec=0, transfer=1, write=0, regno=0x1019
2023-04-12T11:34:46.9257207Z [OPENOCD] Debug: 1136 304 riscv-013.c:397 scan(): 41b w 00221019 @17 -> + 00000000 @04; 2i
2023-04-12T11:34:46.9257509Z [OPENOCD] Debug: 1137 304 riscv-013.c:397 scan(): 41b r 00000000 @16 -> b ffffffff @7f; 2i
2023-04-12T11:34:46.9257827Z [OPENOCD] Debug: 1138 304 riscv-013.c:460 increase_dmi_busy_delay(): dtmcs_idle=1, dmi_busy_delay=3, ac_busy_delay=0
2023-04-12T11:34:46.9258145Z [OPENOCD] Debug: 1139 305 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0xffffffff
2023-04-12T11:34:46.9258446Z [OPENOCD] Debug: 1140 305 riscv-013.c:397 scan(): 41b r 00000000 @16 -> b ffffffff @7f; 3i
2023-04-12T11:34:46.9258767Z [OPENOCD] Debug: 1141 305 riscv-013.c:460 increase_dmi_busy_delay(): dtmcs_idle=1, dmi_busy_delay=4, ac_busy_delay=0
2023-04-12T11:34:46.9259086Z [OPENOCD] Debug: 1142 305 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0xffffffff
2023-04-12T11:34:46.9259381Z [OPENOCD] Debug: 1143 305 riscv-013.c:397 scan(): 41b r 00000000 @16 -> b ffffffff @7f; 4i
2023-04-12T11:34:46.9259701Z [OPENOCD] Debug: 1144 305 riscv-013.c:460 increase_dmi_busy_delay(): dtmcs_idle=1, dmi_busy_delay=5, ac_busy_delay=0
2023-04-12T11:34:46.9260015Z [OPENOCD] Debug: 1145 306 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0xffffffff
2023-04-12T11:34:46.9260315Z [OPENOCD] Debug: 1146 306 riscv-013.c:397 scan(): 41b r 00000000 @16 -> b ffffffff @7f; 5i
2023-04-12T11:34:46.9260633Z [OPENOCD] Debug: 1147 306 riscv-013.c:460 increase_dmi_busy_delay(): dtmcs_idle=1, dmi_busy_delay=6, ac_busy_delay=0
2023-04-12T11:34:46.9260945Z [OPENOCD] Debug: 1148 306 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0xffffffff
2023-04-12T11:34:46.9261245Z [OPENOCD] Debug: 1149 307 riscv-013.c:397 scan(): 41b r 00000000 @16 -> ? 413d5213 @00; 6i
2023-04-12T11:34:46.9261602Z [OPENOCD] Error: 1150 307 riscv-013.c:606 dmi_op_timeout(): failed read at 0x16, status=1
2023-04-12T11:34:46.9261894Z [OPENOCD] Debug: 1151 307 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0x1071
2023-04-12T11:34:46.9262216Z [OPENOCD] Debug: 1152 307 riscv-013.c:806 execute_abstract_command(): command 0x221019 failed; abstractcs=0x0
2023-04-12T11:34:46.9262568Z [OPENOCD] Debug: 1153 307 riscv-013.c:397 scan(): 41b w 00000700 @16 -> + 00000000 @00; 6i
2023-04-12T11:34:46.9262843Z [OPENOCD] Debug: 1154 307 riscv-013.c:407 scan(): cmderr=7 ->
2023-04-12T11:34:46.9263126Z [OPENOCD] Debug: 1155 308 riscv-013.c:397 scan(): 41b - 00000000 @16 -> + 00000700 @16; 6i
2023-04-12T11:34:46.9263402Z [OPENOCD] Debug: 1156 308 riscv-013.c:407 scan(): -> cmderr=7
2023-04-12T11:34:46.9263701Z [OPENOCD] Debug: 1157 308 riscv.c:3450 riscv_get_register(): [riscv.tap.0] s9: ffffffffffffffff
2023-04-12T11:34:46.9264005Z [OPENOCD] Debug: 1158 308 gdb_server.c:1483 gdb_error(): Reporting -4 to GDB as generic error
which contains an OpenOCD patch to dmireset on error (could be useful, not sure if we want to upstream that)
Nice, I think it would be good to upstream that (though it's not the most urgent thing here).
As 41b r 00000000 @16 -> b ffffffff @7f
repeats multiple times, I have the impression that the JTAG tdo
wire is stuck at 1
for a while. I don't know the mechanism behind it, though. Would it be possible to get even more visibility into scan()
and which bits are scanned in and out? Alternatively, the debugger could do a JTAG Test-Logic-Reset when its gets all ones and retry the command
OpenOCD indeeds has another level of logs, but this is too verbose to it with Azure. I modified my PR to run every test with a custom runner that saves the logs to files and upload them as an artefact:
Since the logs are quite large, they can be found by downloading the archive and looking at one of the failed tests:
sw/device/silicon_creator/rom/e2e:asm_interrupt_handler_fpga_cw310_test_otp_test_unlocked0.2.log
sw/device/silicon_creator/rom/e2e:asm_interrupt_handler_fpga_cw310_test_otp_test_unlocked0.3.log
EDIT: done another run because the previous one had incomplete output.
Analysis: the log is quite long, the entire relevant part can be found in this gist and below I am putting only the semi-high-level errors message (the super low-level jtag ones are in the gist):
Debug: 12382 1056 riscv-013.c:783 execute_abstract_command(): command=0x22100e; access register, size=32, postexec=0, transfer=1, write=0, regno=0x100e
Debug: 12466 1057 riscv-013.c:397 scan(): 41b w 0022100e @17 -> + fffffffe @7f; 2i
Debug: 12550 1057 riscv-013.c:397 scan(): 41b r 00000000 @16 -> b ffffffff @7f; 2i
Debug: 12551 1057 riscv-013.c:460 increase_dmi_busy_delay(): dtmcs_idle=1, dmi_busy_delay=3, ac_busy_delay=0
Debug: 12661 1058 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0xffffffff
Debug: 12715 1058 riscv-013.c:397 scan(): 41b r 00000000 @16 -> b ffffffff @7f; 3i
Debug: 12716 1058 riscv-013.c:460 increase_dmi_busy_delay(): dtmcs_idle=1, dmi_busy_delay=4, ac_busy_delay=0
Debug: 12826 1059 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0xffffffff
Debug: 12880 1059 riscv-013.c:397 scan(): 41b r 00000000 @16 -> ? 413d5213 @00; 4i
Error: 12881 1059 riscv-013.c:606 dmi_op_timeout(): failed read at 0x16, status=1
Debug: 12991 1060 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0x1071
Debug: 12992 1060 riscv-013.c:806 execute_abstract_command(): command 0x22100e failed; abstractcs=0x0
Debug: 13076 1060 riscv-013.c:397 scan(): 41b w 00000700 @16 -> + 00000000 @00; 4i
Debug: 13077 1060 riscv-013.c:407 scan(): cmderr=7 ->
Debug: 13131 1060 riscv-013.c:397 scan(): 41b - 00000000 @16 -> + 00000700 @16; 4i
Debug: 13132 1060 riscv-013.c:407 scan(): -> cmderr=7
Debug: 13133 1060 riscv.c:3450 riscv_get_register(): [riscv.tap.0] a4: ffffffffffffffff
Debug: 13134 1060 gdb_server.c:1483 gdb_error(): Reporting -4 to GDB as generic error
Debug: 13135 1060 gdb_server.c:406 gdb_log_outgoing_packet(): [riscv.tap.0] sending packet: $E0E#ba
which contains an OpenOCD patch to dmireset on error (could be useful, not sure if we want to upstream that)
Nice, I think it would be good to upstream that (though it's not the most urgent thing here).
This is awesome work @pamaury ! Another thing we could do is just apply a patch to the openocd code at the bazel level since we vendor it in.
I have an update on the flaky tests and this unfortunately points to a potentially worrying issue. In this test I simply ran 20 times a test where GDB connects to OpenOCD and does nothing. The GDB script is literally:
target extended-remote :3333
echo :::: Done.\\n
The bitstream used disables the ROM executions meaning that the ROM only executes ~10 instructions when booting and then loops:
_rom_start_boot:
// Set up the global pointer and re-enable linker relaxations.
la gp, __global_pointer$
.option pop
LABEL_FOR_TEST(kRomStartBootMaybeHalt)
// Check if we should halt here.
li a0, (TOP_EARLGREY_OTP_CTRL_CORE_BASE_ADDR + \
OTP_CTRL_SW_CFG_WINDOW_REG_OFFSET)
lw t0, OTP_CTRL_PARAM_CREATOR_SW_CFG_ROM_EXEC_EN_OFFSET(a0)
bnez t0, .L_exec_en
LABEL_FOR_TEST(kRomStartBootHalted)
.L_halt_loop:
wfi
j .L_halt_loop
Just doing this already triggers 4 failures out of 20 runs. The azure devops contains an artefact with the logs. The important thing to note here is that simply connecting GDB to OpenOCD actually triggers many operations to be performed like halting the core and reading registers.
Can this issue be closed now, i.e., did https://github.com/lowRISC/opentitan/pull/18051 solve the remaining issues? Or are there tests still failing sporadically? If so which ones are still failing?
Closing this issue for now, as the issues found were fixed which improved substantially the gdb tests. There are still some flakiness, but I'll triage them and open a new issue if needed.
Just to confirm @engdoreis , is there still flakiness with any of the GDB tests? or are different tests flaky?
Here are the latest 5 runs:
So the last 2 runs, all the tests passed, but 4 tests failed before to this. Here are the logs for the March 26th fail.
[OPENOCD] Open On-Chip Debugger 0.12.0
[OPENOCD] Licensed under GNU GPL v2
[OPENOCD] For bug reports, read
[OPENOCD] http://openocd.org/doc/doxygen/bugs.html
[OPENOCD] adapter srst delay: 10
[OPENOCD] Info : Hardware thread awareness created
[OPENOCD] force hard breakpoints
[OPENOCD] Info : Listening on port 6666 for tcl connections
[OPENOCD] Info : Listening on port 4444 for telnet connections
[OPENOCD] Error: libusb_open() failed with LIBUSB_ERROR_NO_DEVICE
[OPENOCD] Error: libusb_open() failed with LIBUSB_ERROR_NO_DEVICE
[OPENOCD] Error: libusb_open() failed with LIBUSB_ERROR_NO_DEVICE
[OPENOCD] Error: libusb_open() failed with LIBUSB_ERROR_NO_DEVICE
[OPENOCD] Info : clock speed 200 kHz
[OPENOCD] Info : JTAG tap: riscv.tap tap/device found: 0x0afaffff (mfg: 0x7ff (<invalid>), part: 0xafaf, ver: 0x0)
[OPENOCD] Warn : JTAG tap: riscv.tap UNEXPECTED: 0x0afaffff (mfg: 0x7ff (<invalid>), part: 0xafaf, ver: 0x0)
[OPENOCD] Error: JTAG tap: riscv.tap expected 1 of 1: 0x04f5484d (mfg: 0x426 (Google Inc), part: 0x4f54, ver: 0x0)
[OPENOCD] Info : JTAG tap: auto0.tap tap/device found: 0x04f5484d (mfg: 0x426 (Google Inc), part: 0x4f54, ver: 0x0)
[OPENOCD] Error: Trying to use configured scan chain anyway...
[OPENOCD] Warn : AUTO auto0.tap - use "jtag newtap auto0 tap -irlen 2 -expected-id 0x04f5484d"
[OPENOCD] Error: auto0.tap: IR capture error; saw 0x0003 not 0x0001
[OPENOCD] Warn : Bypassing JTAG setup events due to errors
[OPENOCD] Error: dtmcontrol is 0. Check JTAG connectivity/board power.
[OPENOCD] Warn : target riscv.tap.0 examination failed
[OPENOCD] Info : starting gdb server for riscv.tap.0 on 3333
Error: OpenOCD failed to examine the core.
I think we should watch the next runs and open issues if they keep failing.
Thanks @engdoreis for the data! Let's keep this open and follow up with @pamaury, @andreaskurth, and @msfschaffner next week (I added a meeting to everyone's calendar).
The error messages
[OPENOCD] Warn : JTAG tap: riscv.tap UNEXPECTED: 0x0afaffff (mfg: 0x7ff (<invalid>), part: 0xafaf, ver: 0x0)
and
[OPENOCD] Error: auto0.tap: IR capture error; saw 0x0003 not 0x0001
seem to fall into the 'stuck-at-1' pattern. I assume OpenOCD puts the JTAG TAP into Test-Logic-Reset before it scans out IDCODE (though we should confirm that to be sure), which makes this failure special because I think it's the first time we're seeing this problem outside the Abstract Command mechanism (that problem is described in #18087). We don't know yet where the problem root lies, but I think this provides evidence that it's more likely in the JTAG (incl TAP implementation) part (or the I/O circuitry) than in the RISC-V-specific part of the debug module (and even less likely that it has something to do with Ibex). This is very valuable for root-causing the problem.
To get around this problem in the short term, we should be able to use the same technique as in #18051, though now in a different part of OpenOCD code. @pamaury WDYT?
i think you are right, the IDCODE is just handled by the TAP and it's quite a strange output here. It also comes to mind that we haven't really tried to eliminate the jtag adapter as a source of error: is it possible that some (all?) of the adapters have a problem? One idea to check this theory would be to log the ID of the adapter on every run so that we can see if there is a correlation between errors and adapeters.
It also comes to mind that we haven't really tried to eliminate the jtag adapter as a source of error: is it possible that some (all?) of the adapters have a problem? One idea to check this theory would be to log the ID of the adapter on every run so that we can see if there is a correlation between errors and adapeters.
I agree; we've reached a point where it makes sense to check whether the problem is correlated with one (or a few) specific JTAG adapters.
For everyone's knowledge. We are getting a CW310 up and running that will do continual tests for this. If there are no failures over a day or so then it pushes more towards there being a fault with individual JTAG adapters and boards. In parallel the failed runs will be examined to see if there is a common board identified with the failures. This is possible given how jobs are allocated in the same order (without randomisation across the boards) so quite frequently the same runs will land on the same board. If these two activities lead to a JTAG adapter being identified then it will be swapped out, otherwise further investigation will be needed. Per discussions on the thread the mitigation / software workaround seems practical, although not ideal. With more time this can be debugged further. If there are alternative suggestions then please continue to list / detail them. @luismarques / @jprendes for visibility
Initial investigations suggest that not just a single board might be affected but there is one board that is more affected.
I'm working on getting an Integrated Logic Analyzer (ILA) core connected to the relevant JTAG signals on an FPGA. That's not trivial because synthesis and implementation algorithms mangle signals. Setting signal-specific don't touch and mark debug attributes didn't help, but marking the signals as debug via TCL commands in XDC files seems to help. A long turnaround time due to high resource utilization on the FPGA doesn't allow fast progress, but I'm slowly getting there ...
I'm working on getting an Integrated Logic Analyzer (ILA) core connected to the relevant JTAG signals on an FPGA. That's not trivial because synthesis and implementation algorithms mangle signals. Setting signal-specific don't touch and mark debug attributes didn't help, but marking the signals as debug via TCL commands in XDC files seems to help. A long turnaround time due to high resource utilization on the FPGA doesn't allow fast progress, but I'm slowly getting there ...
Note that there is also the option of bringing out signals to pins for an external logic analyzer, and unlike with an ILA, this can be done without disturbing the placement and routing of an existing design checkpoint. Since it needs unused, accessible pins on the board, it isn't able to inspect as many signals as you can with an ILA, but if you find adding the ILA makes the problem go away...
In the past, I did this for pwrmgr debug with Tcl commands like create_port, create_net, connect_net, etc. Once you have a script set up, turnaround time becomes significantly shorter (assuming you are able to readily find the signals of interest after the tool's renaming).
One other note: You can generate an ILA core that you can instantiate in RTL pre-synthesis. I've used this in the past when synthesis mangled signals too much.
While working on this, one issue I have identified (I think) and which is quite problematic is that if the ROM tries to boot but not succeed, it will reset the chip and try again. This can be the case in some tests where we do not bootstrap but do not disable ROM execution either. This could be the case of chip in production with flash erased. The problem is that then there is an inherent between between doing jtag operations and the reset, since the reset will also reset the jtag connection. With the RISC-V TAP, one can at least reset the core and halt immediately, but this still leaves the watchdog to disable I think, hence still racy. Am I missing something, is there a good solution to this?
While working on this, one issue I have identified (I think) and which is quite problematic is that if the ROM tries to boot but not succeed, it will reset the chip and try again. This can be the case in some tests where we do not bootstrap but do not disable ROM execution either. This could be the case of chip in production with flash erased. The problem is that then there is an inherent between between doing jtag operations and the reset, since the reset will also reset the jtag connection. With the RISC-V TAP, one can at least reset the core and halt immediately, but this still leaves the watchdog to disable I think, hence still racy. Am I missing something, is there a good solution to this?
ndmreset is supposed to reset the watchdog as well. If it doesn't, that would be a bug, I think.
Hm... but maybe JTAG could be interrupted while trying to issue commands to halt and reset. I guess that might still need a retry loop to verify that we have halted at the intended PC.
Yes ndmreset does reset the watchdog. But as you pointed out, the issue that the chip could reset before we have the time to even issue this command. When that happens, the jtag connection will most likely fail because OpenOCD will get confused so we need to retry the whole connect to OpenOCD and try to stop thing. If the ROM is in a boot loop, it will reset after a very short time so the window to stop the core could be quite small. Furthemore, if we connect to the LC TAP, then we have no way to halt the core. Another failure I have observed is quite nasty:
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd] Waiting for OpenOCD to be ready to accept a TCL connection...
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] Open On-Chip Debugger 0.12.0
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] Licensed under GNU GPL v2
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] For bug reports, read
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] http://openocd.org/doc/doxygen/bugs.html
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] adapter speed: 200 kHz
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr]
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] jtag
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] TapName Enabled IdCode Expected IrLen IrCap IrMask
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] -- ------------------- -------- ---------- ---------- ----- ----- ------
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr]
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] Info : Hardware thread awareness created
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] force hard breakpoints
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] debug_level: 2
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr]
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] Info : clock speed 200 kHz
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] Info : JTAG tap: riscv.tap tap/device found: 0x04f5484d (mfg: 0x426 (Google Inc), part: 0x4f54, ver: 0x0)
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] Info : datacount=2 progbufsize=8
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] Info : Examined RISC-V core; found 1 harts
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] Info : hart 0: XLEN=32, misa=0x40901104
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] Info : starting gdb server for riscv.tap.0 on 3333
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] Info : Listening on port 3333 for gdb connections
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] Info : Listening on port 6666 for tcl connections
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd] Connecting to OpenOCD tcl interface...
[2023-05-08T14:54:34Z WARN opentitanlib::util::openocd] Attempting to make tcp connection...
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] Info : Listening on port 4444 for telnet connections
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr]
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] Info : accepting 'tcl' connection on tcp/6666
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr]
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] Open On-Chip Debugger 0.12.0
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr]
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd] OpenOCD version: Open On-Chip Debugger 0.12.0
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] Info : Hart 0 unexpectedly reset!
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr]
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] Info : JTAG tap: riscv.tap tap/device found: 0x04f5484d (mfg: 0x426 (Google Inc), part: 0x4f54, ver: 0x0)
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr]
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] halted
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr]
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd] status: halted
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] Info : [riscv.tap.0] Found 4 triggers
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr]
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] hart.xlen 32
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] hart.trigger_count 4
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] target.memory.read_while_running8 0
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] target.memory.write_while_running8 0
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] target.memory.read_while_running16 0
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] target.memory.write_while_running16 0
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] target.memory.read_while_running32 0
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] target.memory.write_while_running32 0
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] target.memory.read_while_running64 0
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] target.memory.write_while_running64 0
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] target.memory.read_while_running128 0
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] target.memory.write_while_running128 0
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] dm.abits 7
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] dm.progbufsize 8
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] dm.sbversion 0
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] dm.sbasize 0
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] dm.sbaccess128 0
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] dm.sbaccess64 0
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] dm.sbaccess32 0
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] dm.sbaccess16 0
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] dm.sbaccess8 0
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr] dm.authenticated 1
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr]
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd::stderr]
[2023-05-08T14:54:34Z INFO opentitanlib::util::openocd] info: hart.xlen 32
hart.trigger_count 4
target.memory.read_while_running8 0
target.memory.write_while_running8 0
target.memory.read_while_running16 0
target.memory.write_while_running16 0
target.memory.read_while_running32 0
target.memory.write_while_running32 0
target.memory.read_while_running64 0
target.memory.write_while_running64 0
target.memory.read_while_running128 0
target.memory.write_while_running128 0
dm.abits 7
dm.progbufsize 8
dm.sbversion 0
dm.sbasize 0
dm.sbaccess128 0
dm.sbaccess64 0
dm.sbaccess32 0
dm.sbaccess16 0
dm.sbaccess8 0
dm.authenticated 1
Here we might have a troublesome collision between debug (and manufacturing access?) vs. security.
Is there something ROM should do (maybe together with TAP straps or SW straps?) to present a synchronization point for access? Or perhaps should hardware behavior change to prevent resetting the JTAG TAP by anything but the TAP controller when access is authorized (and the TAP straps select one)?
If we can successfully get through rom_init(), then setting the SW straps to enable bootstrapping does present such a synchronization point. But that doesn't help if you want to actually debug the early parts of the ROM.
Maybe we could use the RMA software straps here because I think the purpose of those was essentially to give time to the software to connect over JTAG. One thing though that we already brough up is that the default value of the number of cycles (CREATOR_SW_CFG_RMA_SPIN_CYCLES
, set to 10) to spin is so low that that's actually impossible to connect over jtag, at all.
Your point about (ab)using the ROM bootstrap is great though: if we set those pins, this gives up plenty of time to halt the core I think. One downside indeed is that if the low level init fails for some reason, the chip will probably reset early.
Another option for the tests is to just bootstrap a signed "empty test", i.e., one that just returns true, so that this reset loop doesn't happen. Right now the plan is for the OTP setting for ROM execution to default to disabled, so this race condition shouldn't happen in production, but currently the OTP image for the test_unlocked LC states defaults to ROM exec enabled.
Maybe the test_unlocked* OTP images should be changed to reflect reality (i.e., have ROM execution set to disabled)? CC: @moidx
I prefer to have an empty test for now (in the interest of time), and have a focus test that has ROM_EXEC
disabled in OTP. Changing the default for TestUnlocked*
would require adding more overhead across e2e and top-level tests. I am ok with it, but not sure we have enough time to make that change this week.
Update on stress tests after applying #18554: we still have failures but many follow a particular, very suspicious pattern:
[1;32m[[0m[32mGDB[0m[1;32m][0m[32m [0m:::: Send OpenOCD the 'reset halt' command.
Debug: 814 1085 riscv-013.c:397 scan(): 41b w 80000001 @10 -> + 00000000 @00; 2i
Debug: 815 1085 riscv-013.c:407 scan(): haltreq dmactive ->
[1;32m[[0m[32mGDB[0m[1;32m][0m[32m [0mJTAG tap: riscv.tap tap/device found: 0x04f5484d (mfg: 0x426 (Google Inc), part: 0x4f54, ver: 0x0)
Debug: 816 1086 riscv-013.c:397 scan(): 41b - 00000000 @10 -> + 80000001 @10; 2i
Debug: 817 1086 riscv-013.c:407 scan(): -> haltreq dmactive
Debug: 818 1086 riscv-013.c:397 scan(): 41b w 80000003 @10 -> + 00000000 @10; 2i
Debug: 819 1086 riscv-013.c:407 scan(): haltreq ndmreset dmactive ->
[1;31m[[0m[31mCONSOLE[0m[1;31m][0m[31m [0m[2023-05-12T10:53:28Z INFO opentitanlib::util::usb] Could not open device at bus=1 address=108: No such device (it may have been disconnected)
Debug: 820 1086 riscv-013.c:397 scan(): 41b - 00000000 @10 -> + 80000003 @10; 2i
Debug: 821 1086 riscv-013.c:407 scan(): -> haltreq ndmreset dmactive
Debug: 822 1086 command.c:155 script_debug(): command - riscv.tap.0 invoke-event reset-assert-post
Debug: 823 1086 command.c:155 script_debug(): command - riscv.tap.0 invoke-event reset-deassert-pre
Debug: 824 1086 command.c:155 script_debug(): command - transport select
Debug: 825 1086 command.c:155 script_debug(): command - expr [ string first "jtag" $_TRANSPORT ] != -1
Debug: 826 1086 command.c:155 script_debug(): command - riscv.tap.0 cget -chain-position
Debug: 827 1086 command.c:155 script_debug(): command - jtag tapisenabled riscv.tap
Debug: 828 1086 command.c:155 script_debug(): command - riscv.tap.0 arp_reset deassert 1
Debug: 829 1086 target.c:2199 target_free_all_working_areas_restore(): freeing all working areas
Debug: 830 1086 riscv.c:1283 riscv_deassert_reset(): [0]
[1;31m[[0m[31mCONSOLE[0m[1;31m][0m[31m [0m[2023-05-12T10:53:28Z INFO opentitanlib::util::usb] Could not open device at bus=1 address=12: No such device (it may have been disconnected)
[1;31m[[0m[31mCONSOLE[0m[1;31m][0m[31m [0m[2023-05-12T10:53:28Z INFO opentitanlib::util::usb] Could not open device at bus=1 address=52: No such device (it may have been disconnected)
Debug: 831 1087 riscv-013.c:397 scan(): 41b w 80000001 @10 -> + 00000000 @10; 2i
Debug: 832 1087 riscv-013.c:407 scan(): haltreq dmactive ->
Debug: 833 1087 riscv-013.c:397 scan(): 41b - 00000000 @10 -> + fff00001 @7f; 2i
[1;31m[[0m[31mCONSOLE[0m[1;31m][0m[31m [0m[2023-05-12T10:53:28Z INFO opentitanlib::util::usb] Could not open device at bus=1 address=10: No such device (it may have been disconnected)
Debug: 834 1087 riscv-013.c:397 scan(): 41b w 80000001 @10 -> b ffffffff @7f; 2i
Debug: 835 1087 riscv-013.c:407 scan(): haltreq dmactive ->
Debug: 836 1087 riscv-013.c:460 increase_dmi_busy_delay(): dtmcs_idle=1, dmi_busy_delay=3, ac_busy_delay=0
[1;31m[[0m[31mCONSOLE[0m[1;31m][0m[31m [0m[2023-05-12T10:53:28Z DEBUG opentitanlib::transport::cw310::usb] WRITE_CTRL: bmRequestType: 41, bRequest: 34, wValue: 00a0, wIndex: 0000, data: [99, 2]
[1;31m[[0m[31mCONSOLE[0m[1;31m][0m[31m [0mDebug: 837 1088 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0xffffffff
[2023-05-12T10:53:28Z DEBUG opentitanlib::transport::cw310::usb] WRITE_CTRL: bmRequestType: 41, bRequest: 34, wValue: 00a0, wIndex: 0000, data: [100, 2]
Debug: 838 1088 riscv-013.c:397 scan(): 41b w 80000001 @10 -> b ffffffff @7f; 3i
Debug: 839 1088 riscv-013.c:407 scan(): haltreq dmactive ->
Debug: 840 1088 riscv-013.c:460 increase_dmi_busy_delay(): dtmcs_idle=1, dmi_busy_delay=4, ac_busy_delay=0
[1;31m[[0m[31mCONSOLE[0m[1;31m][0m[31m [0m[2023-05-12T10:53:28Z DEBUG opentitanlib::transport::cw310::usb] WRITE_CTRL: bmRequestType: 41, bRequest: 34, wValue: 00a0, wIndex: 0000, data: [104, 2]
[1;31m[[0m[31mCONSOLE[0m[1;31m][0m[31m [0m[2023-05-12T10:53:28Z DEBUG opentitanlib::transport::cw310::usb] WRITE_CTRL: bmRequestType: 41, bRequest: 34, wValue: 00a0, wIndex: 0000, data: [105, 2]
Debug: 841 1088 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0x1071
Here it looks like even though openocd was told not to use srst, something is reset, or maybe the strap pins are touched by opentitantool because the usb adapter is disconnected/reconnected. I am investigating this.
Hierarchy of regression failure
Chip Level
Failure Description
The ROM E2E tests that use the
rv_dm + openocd + gdb
are shaky lately.The reason should be investigated, and the tests should be improved.
Steps to Reproduce
Tests with similar or related failures