lowRISC / opentitan

OpenTitan: Open source silicon root of trust
https://www.opentitan.org
Apache License 2.0
2.6k stars 783 forks source link

[test-triage,rom_e2e,opentitanlib] gdb/openocd/jtag tests flaky/failing #17729

Closed engdoreis closed 1 year ago

engdoreis commented 1 year ago

Hierarchy of regression failure

Chip Level

Failure Description

The ROM E2E tests that use the rv_dm + openocd + gdb are shaky lately. image

The reason should be investigated, and the tests should be improved.

Steps to Reproduce

Tests with similar or related failures

pamaury commented 1 year ago

Observations regarding rom_e2e_debug_test_otp_test_unlocked0

I tried to debug the problem: this tests fails consistently on fred. Step to get more debugging:

pamaury commented 1 year ago

I was able to confirm independently (using the OpenOCD jtag struct in opentitanlib) that once a read using the sysbus fails, all subsequence reads seem to fail. I don't know yet if this is an OpenOCD problem or a hardware problem.

timothytrippel commented 1 year ago

Couple things:

  1. not sure how e2e_chip_specific_startup_rma_fpga_cw310_rom could be causing similar issues, since this test does not use the JTAG/OpenOCD/GDB infra.
  2. @pamaury, when you used the OpenOCD/JTAG infra in opentitanlib to perform sysbus ops (i.e., taking GDB out of the picture), are you seeing flakiness with the operations? or are the failures more consistent?
pamaury commented 1 year ago

I have only investigated rom_e2e_debug_test_otp_test_unlocked0 so far. In my tests, it has not worked a single time yet (ie no flakiness at all, just plain does not work), both for rom_e2e_debug_test_otp_test_unlocked0 and with opencdo+jtag+opentitanlib. I have also tried with OpenOCD 0.12 (we are still using -rc1 until #17646 ) gets merged, same result.

pamaury commented 1 year ago

Here is an interesting trace using jtag+openocd directly:

[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 480 1431 command.c:155 script_debug(): command - read_memory 0x10000000 8 4 phys
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 481 1431 riscv.c:2186 riscv_openocd_poll(): polling all harts
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 482 1431 riscv.c:3304 riscv_set_current_hartid(): setting hartid to 0, was 0
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 483 1431 riscv.c:2091 riscv_poll_hart(): polling hart 0, target->state=2
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 484 1432 riscv-013.c:397 scan(): 41b r 00000000 @11 -> + 00000000 @38; 2i
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 485 1433 riscv-013.c:397 scan(): 41b - 00000000 @11 -> + 00030382 @11; 2i
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 486 1433 riscv-013.c:407 scan():  ->  allresumeack anyresumeack allhalted anyhalted authenticated version=2
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 487 1433 riscv.c:3304 riscv_set_current_hartid(): setting hartid to 0, was 0
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 488 1434 riscv-013.c:397 scan(): 41b w 00118000 @38 -> + 00000000 @11; 2i
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 489 1434 riscv-013.c:407 scan():  sbreadonaddr sbautoincrement sbreadondata -> 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 490 1434 riscv-013.c:397 scan(): 41b - 00000000 @38 -> + 00118000 @38; 2i
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 491 1434 riscv-013.c:407 scan():  ->  sbreadonaddr sbautoincrement sbreadondata
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 492 1435 riscv-013.c:397 scan(): 41b w 10000000 @39 -> + 00000000 @38; 2i
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 493 1435 riscv-013.c:397 scan(): 41b - 00000000 @39 -> + 10000000 @39; 2i
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 494 1435 riscv-013.c:397 scan(): 41b r 00000000 @3c -> + 00000000 @39; 2i
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 495 1436 riscv-013.c:397 scan(): 41b r 00000000 @3c -> + 30aad985 @3c; 2i
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 496 1436 riscv-013.c:2507 log_memory_access(): M[0x10000000] reads 0x85
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 497 1436 riscv-013.c:397 scan(): 41b r 00000000 @3c -> + 0030aad9 @3c; 2i
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 498 1436 riscv-013.c:2507 log_memory_access(): M[0x10000001] reads 0xd9
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 499 1436 riscv-013.c:397 scan(): 41b - 00000000 @00 -> + 000030aa @3c; 2i
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 500 1436 riscv-013.c:2507 log_memory_access(): M[0x10000002] reads 0xaa
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 501 1437 riscv-013.c:397 scan(): 41b r 00000000 @38 -> + 00000000 @00; 2i
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 502 1437 riscv-013.c:397 scan(): 41b - 00000000 @38 -> + 20118407 @38; 2i
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 503 1437 riscv-013.c:407 scan():  ->  sbversion=1 sbreadonaddr sbautoincrement sbreadondata sbasize=32 sbaccess32 sbaccess16 sbaccess8
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 504 1438 riscv-013.c:397 scan(): 41b w 00110000 @38 -> + 00000000 @38; 2i
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 505 1438 riscv-013.c:407 scan():  sbreadonaddr sbautoincrement -> 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 506 1438 riscv-013.c:397 scan(): 41b - 00000000 @38 -> + 00110000 @38; 2i
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 507 1438 riscv-013.c:407 scan():  ->  sbreadonaddr sbautoincrement
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 508 1438 riscv-013.c:397 scan(): 41b r 00000000 @3c -> + 00000000 @38; 2i
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 509 1439 riscv-013.c:397 scan(): 41b - 00000000 @3c -> + 00000030 @3c; 2i
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 510 1439 riscv-013.c:2507 log_memory_access(): M[0x10000003] reads 0x30
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 511 1439 riscv-013.c:397 scan(): 41b r 00000000 @38 -> + 00000000 @3c; 2i
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 512 1440 riscv-013.c:397 scan(): 41b - 00000000 @38 -> + 20110407 @38; 2i
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 513 1440 riscv-013.c:407 scan():  ->  sbversion=1 sbreadonaddr sbautoincrement sbasize=32 sbaccess32 sbaccess16 sbaccess8
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 514 1440 riscv-013.c:2848 log_mem_access_result(): Succeeded to read memory via system bus.
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] User : 515 1440 options.c:52 configuration_output_handler(): 0x85 0xd9 0xaa 0x30User : 516 1440 options.c:52 configuration_output_handler(): 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 517 1483 command.c:155 script_debug(): command - read_memory 0xdeadbeef 32 1 phys
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 518 1484 riscv.c:2186 riscv_openocd_poll(): polling all harts
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 519 1484 riscv.c:3304 riscv_set_current_hartid(): setting hartid to 0, was 0
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 520 1484 riscv.c:2091 riscv_poll_hart(): polling hart 0, target->state=2
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 521 1484 riscv-013.c:397 scan(): 41b r 00000000 @11 -> + 00000000 @38; 2i
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 522 1484 riscv-013.c:397 scan(): 41b - 00000000 @11 -> + 00030382 @11; 2i
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 523 1484 riscv-013.c:407 scan():  ->  allresumeack anyresumeack allhalted anyhalted authenticated version=2
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 524 1484 riscv.c:3304 riscv_set_current_hartid(): setting hartid to 0, was 0
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 525 1485 riscv-013.c:397 scan(): 41b w 00150000 @38 -> + 00000000 @11; 2i
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 526 1485 riscv-013.c:407 scan():  sbreadonaddr sbaccess=2 sbautoincrement -> 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 527 1485 riscv-013.c:397 scan(): 41b - 00000000 @38 -> + 00150000 @38; 2i
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 528 1485 riscv-013.c:407 scan():  ->  sbreadonaddr sbaccess=2 sbautoincrement
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 529 1486 riscv-013.c:397 scan(): 41b w deadbeef @39 -> + 00000000 @38; 2i
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 530 1486 riscv-013.c:397 scan(): 41b - 00000000 @39 -> + deadbeef @39; 2i
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 531 1486 riscv-013.c:397 scan(): 41b r 00000000 @3c -> + 00000000 @39; 2i
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 532 1487 riscv-013.c:397 scan(): 41b - 00000000 @3c -> + 00000030 @3c; 2i
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 533 1487 riscv-013.c:2507 log_memory_access(): M[0xdeadbeef] reads 0x00000030
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 534 1487 riscv-013.c:397 scan(): 41b r 00000000 @38 -> + 00000000 @3c; 2i
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 535 1488 riscv-013.c:397 scan(): 41b - 00000000 @38 -> + 20153407 @38; 2i
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 536 1488 riscv-013.c:407 scan():  ->  sbversion=1 sbreadonaddr sbaccess=2 sbautoincrement sberror=3 sbasize=32 sbaccess32 sbaccess16 sbaccess8
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 537 1488 riscv-013.c:397 scan(): 41b w 00007000 @38 -> + 00000000 @38; 2i
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 538 1488 riscv-013.c:407 scan():  sberror=7 -> 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 539 1488 riscv-013.c:397 scan(): 41b - 00000000 @38 -> + 00007000 @38; 2i
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Debug: 540 1488 riscv-013.c:407 scan():  ->  sberror=7
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Warn : 541 1488 riscv-013.c:2846 log_mem_access_result(): Failed to read memory via system bus.
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Error: 542 1488 riscv-013.c:3596 read_memory(): Target riscv.tap.0: Failed to read memory (addr=0xdeadbeef)
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Error: 543 1488 riscv-013.c:3597 read_memory():   progbuf=disabled, sysbus=failed, abstract=disabled
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] Error: 544 1488 target.c:4707 target_jim_read_memory(): read_memory: read at 0xdeadbeef with width=32 and count=1 failed
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] 
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr] User : 545 1488 command.c:608 command_run_line(): read_memory: failed to read memory
[2023-03-29T16:13:10Z INFO  opentitanlib::util::openocd::stderr]

We can see that right the bad read, sberror=3 (alignment error, this is correct). After that, openocd tries to clear the rror with sberror=7 (sberror is r/w1c) but then sberror=7 sticks which is clearly wrong...

pamaury commented 1 year ago

Looking at /home/pamaury/project/opentitan/hw/vendor/pulp_riscv_dbg/src/dm_csrs.sv, line 460

sbcs_d.sberror     = sbcs_q.sberror     & {3{~(sbcs.sberror == 3'd1)}};

This seems suspicious, it seems to clear the errors when writing 1, instead of 7 (as per the spec). Spec says:

When the Debug Module’s system bus master en-
counters an error, this field gets set. The bits in
this field remain set until they are cleared by writ-
ing 1 to them. While this field is non-zero, no
more system bus accesses can be initiated by the
Debug Module

Need a verilog expert to confirm.

timothytrippel commented 1 year ago

Looking at the current RTL, the update mechanism for the SBCS CSR is:

  1. check if the sberror field of the DMI request is equal to 001, and
  2. if so, AND the sberror field of the SBCS CSR with 000 to clear it.

I think the issue is that the DMI request data bits expected should be 111 (i.e., 7), as this is what OpenOCD sends, and is the what matches the spec (as @pamaury points out).

So I think this line should be:

sbcs_d.sberror = sbcs_q.sberror & {3{~(sbcs.sberror == 3'b111)}};

instead of:

sbcs_d.sberror = sbcs_q.sberror & {3{~(sbcs.sberror == 3'd1)}};
a-will commented 1 year ago

Looking at the current RTL, the update mechanism for the SBCS CSR is:

  1. check if the sberror field of the DMI request is equal to 001, and
  2. if so, AND the sberror field of the SBCS CSR with 000 to clear it.

I think the issue is that the DMI request data bits expected should be 111 (i.e., 7), as this is what OpenOCD sends, and is the what matches the spec (as @pamaury points out).

So I think this line should be:

sbcs_d.sberror = sbcs_q.sberror & {3{~(sbcs.sberror == 3'b111)}};

instead of:

sbcs_d.sberror = sbcs_q.sberror & {3{~(sbcs.sberror == 3'd1)}};

Or perhaps just...

sbcs_d.sberror = sbcs_q.sberror & ~sbcs.sberror;

The original author may have misinterpreted the meaning of "write 1 to clear" :)

timothytrippel commented 1 year ago

well the problem with:

sbcs_d.sberror = sbcs_q.sberror & ~sbcs.sberror;

is it could cause partial clearing of the sberror field.

Which may not be an issue, the spec is unclear from what I can tell, so perhaps this ok? The reason I doubt it would be ok though, is because the error code uses all bits, so a partial clearing would change the error code meaning.

timothytrippel commented 1 year ago

I submitted an issues on the matter here: https://github.com/pulp-platform/riscv-dbg/issues/154#issue-1646309814

CC: @msfschaffner @andreaskurth

a-will commented 1 year ago

well the problem with:

sbcs_d.sberror = sbcs_q.sberror & ~sbcs.sberror;

is it could cause partial clearing of the sberror field.

Which may not be an issue, the spec is unclear from what I can tell, so perhaps this ok?

It's a funky trade-off, and the spec should probably clarify that... Writing 3'b111 is really the only right thing to do from the host side, since this isn't a bit map or one-hot encoding, but on the OT side, rejecting writes that only explicitly clear the set bits also doesn't seem right.

A different approach might be...

// If we clear one, we clear all
sbcs_d.sberror = sbcs_q.sberror & ~{3{|sbcs.sberror}};
timothytrippel commented 1 year ago

yep, agreed, either works, I think whichever mechanism just needs to clear all bits of the field.

andreaskurth commented 1 year ago

Thanks for flagging and investigating this; I'll look into this on Friday (I'm OoO tomorrow)

timothytrippel commented 1 year ago

sg, thanks @andreaskurth !

pamaury commented 1 year ago

@andreaskurth Should the spec be clarified in this respect? I don't know if one can just do a PR on this spec repo.

andreaskurth commented 1 year ago

I don't think the spec should be changed but the RTL should, see my response on the upstream issue. Let's give the upstream maintainers a bit of time to consider this. If we don't get a reply, we can patch it in this repo

pamaury commented 1 year ago

More tests:

pamaury commented 1 year ago

New finding on sram_program_fpga_cw310_test_otp_test_unlocked0: the following fails:

Conclusion: reading a memory location in the main RAM that was never written to fails!

This explains the test failure: GDB tries to read the top of the stack but sp is pointing to the top of the RAM and the stack is not initialised.

This could be due to ECC checks failures because of memory scrambling (thanks @engdoreis for suggestion). To be confirmed.

pamaury commented 1 year ago

PR to solve sram_program_fpga_cw310_test_otp_test_unlocked0: #17815

pamaury commented 1 year ago

I have done several manual runs of the nightly regression tests with more logging:

Summary:

The failure looks random, there isn't a single cause: sometimes the hart is unexpectedly reset, something openocd simply doesn't manage to talk to the core.

andreaskurth commented 1 year ago

I have done several manual runs of the nightly regression tests with more logging:

Assuming this is not resolved yet, can you summarize the logs?

pamaury commented 1 year ago

I have ran a special "nightly" CI test where I repeated a limited selection of tests 10 times each to see what happens and I managed to get a fair number of failures: https://dev.azure.com/lowrisc/opentitan/_build/results?buildId=112397&view=logs&j=ca4fe378-ea04-5a89-5894-29913db80385&t=05ba4405-1aa2-5a0c-7d47-3e8a33bd2fee

Interestingly, many failures are similar and show this kind of pattern:

[OPENOCD] Error: failed read at 0x16, status=1
[OPENOCD] Info : dropped 'gdb' connection
[GDB] sw/device/silicon_creator/rom/e2e/rom_e2e_asm_watchdog_bark_fpga_cw310_test_otp_test_unlocked0.gdb:2: Error in sourced command file:
GDB exited with code 1
[GDB] Could not read registers; remote failure reply 'E0E'

I am not sure yet what it means but it seems to happen with registers 0x16 and 0x17, and also the remote failure is quite unusual. I plan to investigate this in more details tomorrow.

EDIT: I re-ran the test with a lot more debug, output is here and here is the relevant part:

2023-04-11T09:51:16.1727711Z [OPENOCD] Debug: 943 315 riscv-013.c:782 execute_abstract_command(): command=0x221009; access register, size=32, postexec=0, transfer=1, write=0, regno=0x1009
2023-04-11T09:51:16.1728045Z [OPENOCD] Debug: 944 316 riscv-013.c:397 scan(): 41b w 00221009 @17 -> + 00000000 @04; 2i
2023-04-11T09:51:16.1728343Z [OPENOCD] Debug: 945 316 riscv-013.c:397 scan(): 41b r 00000000 @16 -> F ffffffff @7f; 2i
2023-04-11T09:51:16.1728633Z [OPENOCD] Error: 946 316 riscv-013.c:606 dmi_op_timeout(): failed read at 0x16, status=2
2023-04-11T09:51:16.1728953Z [OPENOCD] Debug: 947 316 riscv-013.c:805 execute_abstract_command(): command 0x221009 failed; abstractcs=0x0
2023-04-11T09:51:16.1729264Z [OPENOCD] Debug: 948 316 riscv-013.c:397 scan(): 41b w 00000700 @16 -> b ffffffff @7f; 2i
2023-04-11T09:51:16.1729538Z [OPENOCD] Debug: 949 316 riscv-013.c:407 scan():  cmderr=7 ->
2023-04-11T09:51:16.1729845Z [OPENOCD] Debug: 950 316 riscv-013.c:460 increase_dmi_busy_delay(): dtmcs_idle=1, dmi_busy_delay=3, ac_busy_delay=0
2023-04-11T09:51:16.1730163Z [OPENOCD] Debug: 951 317 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0xffffffff
2023-04-11T09:51:16.1730461Z [OPENOCD] Debug: 952 317 riscv-013.c:397 scan(): 41b w 00000700 @16 -> b ffffffff @7f; 3i
2023-04-11T09:51:16.1730775Z [OPENOCD] Debug: 953 317 riscv-013.c:407 scan():  cmderr=7 ->
2023-04-11T09:51:16.1731157Z [OPENOCD] Debug: 954 317 riscv-013.c:460 increase_dmi_busy_delay(): dtmcs_idle=1, dmi_busy_delay=4, ac_busy_delay=0
2023-04-11T09:51:16.1731480Z [OPENOCD] Debug: 955 317 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0xffffffff
2023-04-11T09:51:16.1731884Z [OPENOCD] Debug: 956 318 riscv-013.c:397 scan(): 41b w 00000700 @16 -> b ffffffff @7f; 4i
2023-04-11T09:51:16.1732156Z [OPENOCD] Debug: 957 318 riscv-013.c:407 scan():  cmderr=7 ->
2023-04-11T09:51:16.1732465Z [OPENOCD] Debug: 958 318 riscv-013.c:460 increase_dmi_busy_delay(): dtmcs_idle=1, dmi_busy_delay=5, ac_busy_delay=0
2023-04-11T09:51:16.1732782Z [OPENOCD] Debug: 959 318 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0xffffffff
2023-04-11T09:51:16.1733080Z [OPENOCD] Debug: 960 318 riscv-013.c:397 scan(): 41b w 00000700 @16 -> b ffffffff @7f; 5i
2023-04-11T09:51:16.1733352Z [OPENOCD] Debug: 961 318 riscv-013.c:407 scan():  cmderr=7 ->
2023-04-11T09:51:16.1733657Z [OPENOCD] Debug: 962 318 riscv-013.c:460 increase_dmi_busy_delay(): dtmcs_idle=1, dmi_busy_delay=6, ac_busy_delay=0
2023-04-11T09:51:16.1733970Z [OPENOCD] Debug: 963 319 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0xffffffff
2023-04-11T09:51:16.1734270Z [OPENOCD] Debug: 964 319 riscv-013.c:397 scan(): 41b w 00000700 @16 -> b ffffffff @7f; 6i
2023-04-11T09:51:16.1734543Z [OPENOCD] Debug: 965 319 riscv-013.c:407 scan():  cmderr=7 ->
2023-04-11T09:51:16.1734845Z [OPENOCD] Debug: 966 319 riscv-013.c:460 increase_dmi_busy_delay(): dtmcs_idle=1, dmi_busy_delay=7, ac_busy_delay=0
2023-04-11T09:51:16.1735159Z [OPENOCD] Debug: 967 319 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0x1071
2023-04-11T09:51:16.1735457Z [OPENOCD] Debug: 968 320 riscv-013.c:397 scan(): 41b w 00000700 @16 -> + 00000000 @00; 7i
2023-04-11T09:51:16.1735729Z [OPENOCD] Debug: 969 320 riscv-013.c:407 scan():  cmderr=7 ->
2023-04-11T09:51:16.1736013Z [OPENOCD] Debug: 970 320 riscv-013.c:397 scan(): 41b - 00000000 @16 -> + 00000700 @16; 7i
2023-04-11T09:51:16.1736285Z [OPENOCD] Debug: 971 320 riscv-013.c:407 scan():  ->  cmderr=7
2023-04-11T09:51:16.1736576Z [OPENOCD] Debug: 972 320 riscv.c:3452 riscv_get_register(): [riscv.tap.0] s1: ffffffffffffffff
2023-04-11T09:51:16.1736884Z [OPENOCD] Debug: 973 320 gdb_server.c:1483 gdb_error(): Reporting -4 to GDB as generic error

Analysis: a command fails for some reason, sets cmderr to 7, then OpenOCD tries to clear the error by writing 7 to it as per the spec but this does nothing. @andreaskurth could you have a look at this?

andreaskurth commented 1 year ago

Line 1: Abstract command for a register read of 32 bit from GPR x9 = s1.

(Lines 2 and 3: I don't know the code for scan(), what exactly do those mean?)

Line 4: Read from Abstract Control and Status (abstractcs) register fails in dmi_op_timeout() -- what does status=2 mean? That dtmcs.dmistat is 2? If so, dmi.op will be stuck at 2 until the DMI is reset with W1 to dtmcs.dmireset. Does that happen?

Lines 6 ff are repeated tries to W1C the cmderr field of abstractcs, which fail. I think that happens because the abstract command is still busy, in which case clearing the cmderr register is not allowed: https://github.com/lowRISC/opentitan/blob/ae6e5adf94567db9a0233e96acbf8b7e869b096f/hw/vendor/pulp_riscv_dbg/src/dm_csrs.sv?plain=1#L397-L404

Do we observe the dm::DTM_BUSY (0x3) response?

So I think:

pamaury commented 1 year ago

The code is essentially here and there. The scan means that it pushes the bits through the jtag chain. Looking at it and the log, my understanding is:

Debug: 944 316 riscv-013.c:397 scan(): 41b w 00221009 @17 -> + 00000000 @04; 2i

write 0x00221009 to register 0x17 (command, Abstract Command)

Debug: 945 316 riscv-013.c:397 scan(): 41b r 00000000 @16 -> F ffffffff @7f; 2i
Error: 946 316 riscv-013.c:606 dmi_op_timeout(): failed read at 0x16, status=2
Debug: 947 316 riscv-013.c:805 execute_abstract_command(): command 0x221009 failed; abstractcs=0x0

read register 0x16 (abstractcs), get 0xffffffff in register 0x7f, on success we would have expected to see the - 00221009 @17 to notify success of the previous command but instead we get op=2 (error) in dmi register of the DTM which means that the previous command failed. Importantly openOCD does not seem to set dmireset at this point.

Debug: 948 316 riscv-013.c:397 scan(): 41b w 00000700 @16 -> b ffffffff @7f; 2i
Debug: 949 316 riscv-013.c:407 scan():  cmderr=7 ->
Debug: 950 316 riscv-013.c:460 increase_dmi_busy_delay(): dtmcs_idle=1, dmi_busy_delay=3, ac_busy_delay=0
Debug: 951 317 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0xffffffff

OpenOCD tries to write 0x700 to abstractcs, probably to clear any error. The DTM reports a busy status (b) so OpenOCD increases its timeout (the 2i will now recome 3i) and sets dmireset in dtmcs. At this point the DTM answers with 0xffffffff which I am not sure is expected.

Debug: 952 317 riscv-013.c:397 scan(): 41b w 00000700 @16 -> b ffffffff @7f; 3i
Debug: 953 317 riscv-013.c:407 scan():  cmderr=7 ->
Debug: 954 317 riscv-013.c:460 increase_dmi_busy_delay(): dtmcs_idle=1, dmi_busy_delay=4, ac_busy_delay=0
Debug: 955 317 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0xffffffff

Same routine, openocd tries to clear cmerr but the DTM is still busy. This process is repeat a few times and eventually the DTM responds to the reset:

Debug: 967 319 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0x1071

and OpenOCD tries to clear the error again

Debug: 968 320 riscv-013.c:397 scan(): 41b w 00000700 @16 -> + 00000000 @00; 7i
Debug: 969 320 riscv-013.c:407 scan():  cmderr=7 ->
Debug: 970 320 riscv-013.c:397 scan(): 41b - 00000000 @16 -> + 00000700 @16; 7i
Debug: 971 320 riscv-013.c:407 scan():  ->  cmderr=7

but this seems to have no effect. Note that the DM answered + 00000700 @16; so it is not busy but cmderr is still 7.

There are a few unclear points though:

pamaury commented 1 year ago

I did another run here: https://dev.azure.com/lowrisc/opentitan/_build/results?buildId=112650&view=results with #17840 which contains an OpenOCD patch to dmireset on error (could be useful, not sure if we want to upstream that). Interestingly, I got a "clean" failure without the error but just the busy, again on the same command:

2023-04-12T11:34:46.9256871Z [OPENOCD] Debug: 1135 303 riscv-013.c:783 execute_abstract_command(): command=0x221019; access register, size=32, postexec=0, transfer=1, write=0, regno=0x1019
2023-04-12T11:34:46.9257207Z [OPENOCD] Debug: 1136 304 riscv-013.c:397 scan(): 41b w 00221019 @17 -> + 00000000 @04; 2i
2023-04-12T11:34:46.9257509Z [OPENOCD] Debug: 1137 304 riscv-013.c:397 scan(): 41b r 00000000 @16 -> b ffffffff @7f; 2i
2023-04-12T11:34:46.9257827Z [OPENOCD] Debug: 1138 304 riscv-013.c:460 increase_dmi_busy_delay(): dtmcs_idle=1, dmi_busy_delay=3, ac_busy_delay=0
2023-04-12T11:34:46.9258145Z [OPENOCD] Debug: 1139 305 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0xffffffff
2023-04-12T11:34:46.9258446Z [OPENOCD] Debug: 1140 305 riscv-013.c:397 scan(): 41b r 00000000 @16 -> b ffffffff @7f; 3i
2023-04-12T11:34:46.9258767Z [OPENOCD] Debug: 1141 305 riscv-013.c:460 increase_dmi_busy_delay(): dtmcs_idle=1, dmi_busy_delay=4, ac_busy_delay=0
2023-04-12T11:34:46.9259086Z [OPENOCD] Debug: 1142 305 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0xffffffff
2023-04-12T11:34:46.9259381Z [OPENOCD] Debug: 1143 305 riscv-013.c:397 scan(): 41b r 00000000 @16 -> b ffffffff @7f; 4i
2023-04-12T11:34:46.9259701Z [OPENOCD] Debug: 1144 305 riscv-013.c:460 increase_dmi_busy_delay(): dtmcs_idle=1, dmi_busy_delay=5, ac_busy_delay=0
2023-04-12T11:34:46.9260015Z [OPENOCD] Debug: 1145 306 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0xffffffff
2023-04-12T11:34:46.9260315Z [OPENOCD] Debug: 1146 306 riscv-013.c:397 scan(): 41b r 00000000 @16 -> b ffffffff @7f; 5i
2023-04-12T11:34:46.9260633Z [OPENOCD] Debug: 1147 306 riscv-013.c:460 increase_dmi_busy_delay(): dtmcs_idle=1, dmi_busy_delay=6, ac_busy_delay=0
2023-04-12T11:34:46.9260945Z [OPENOCD] Debug: 1148 306 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0xffffffff
2023-04-12T11:34:46.9261245Z [OPENOCD] Debug: 1149 307 riscv-013.c:397 scan(): 41b r 00000000 @16 -> ? 413d5213 @00; 6i
2023-04-12T11:34:46.9261602Z [OPENOCD] Error: 1150 307 riscv-013.c:606 dmi_op_timeout(): failed read at 0x16, status=1
2023-04-12T11:34:46.9261894Z [OPENOCD] Debug: 1151 307 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0x1071
2023-04-12T11:34:46.9262216Z [OPENOCD] Debug: 1152 307 riscv-013.c:806 execute_abstract_command(): command 0x221019 failed; abstractcs=0x0
2023-04-12T11:34:46.9262568Z [OPENOCD] Debug: 1153 307 riscv-013.c:397 scan(): 41b w 00000700 @16 -> + 00000000 @00; 6i
2023-04-12T11:34:46.9262843Z [OPENOCD] Debug: 1154 307 riscv-013.c:407 scan():  cmderr=7 ->
2023-04-12T11:34:46.9263126Z [OPENOCD] Debug: 1155 308 riscv-013.c:397 scan(): 41b - 00000000 @16 -> + 00000700 @16; 6i
2023-04-12T11:34:46.9263402Z [OPENOCD] Debug: 1156 308 riscv-013.c:407 scan():  ->  cmderr=7
2023-04-12T11:34:46.9263701Z [OPENOCD] Debug: 1157 308 riscv.c:3450 riscv_get_register(): [riscv.tap.0] s9: ffffffffffffffff
2023-04-12T11:34:46.9264005Z [OPENOCD] Debug: 1158 308 gdb_server.c:1483 gdb_error(): Reporting -4 to GDB as generic error
andreaskurth commented 1 year ago

which contains an OpenOCD patch to dmireset on error (could be useful, not sure if we want to upstream that)

Nice, I think it would be good to upstream that (though it's not the most urgent thing here).

As 41b r 00000000 @16 -> b ffffffff @7f repeats multiple times, I have the impression that the JTAG tdo wire is stuck at 1 for a while. I don't know the mechanism behind it, though. Would it be possible to get even more visibility into scan() and which bits are scanned in and out? Alternatively, the debugger could do a JTAG Test-Logic-Reset when its gets all ones and retry the command

pamaury commented 1 year ago

OpenOCD indeeds has another level of logs, but this is too verbose to it with Azure. I modified my PR to run every test with a custom runner that saves the logs to files and upload them as an artefact:

Since the logs are quite large, they can be found by downloading the archive and looking at one of the failed tests:

EDIT: done another run because the previous one had incomplete output.

Analysis: the log is quite long, the entire relevant part can be found in this gist and below I am putting only the semi-high-level errors message (the super low-level jtag ones are in the gist):

Debug: 12382 1056 riscv-013.c:783 execute_abstract_command(): command=0x22100e; access register, size=32, postexec=0, transfer=1, write=0, regno=0x100e
Debug: 12466 1057 riscv-013.c:397 scan(): 41b w 0022100e @17 -> + fffffffe @7f; 2i
Debug: 12550 1057 riscv-013.c:397 scan(): 41b r 00000000 @16 -> b ffffffff @7f; 2i
Debug: 12551 1057 riscv-013.c:460 increase_dmi_busy_delay(): dtmcs_idle=1, dmi_busy_delay=3, ac_busy_delay=0
Debug: 12661 1058 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0xffffffff
Debug: 12715 1058 riscv-013.c:397 scan(): 41b r 00000000 @16 -> b ffffffff @7f; 3i
Debug: 12716 1058 riscv-013.c:460 increase_dmi_busy_delay(): dtmcs_idle=1, dmi_busy_delay=4, ac_busy_delay=0
Debug: 12826 1059 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0xffffffff
Debug: 12880 1059 riscv-013.c:397 scan(): 41b r 00000000 @16 -> ? 413d5213 @00; 4i
Error: 12881 1059 riscv-013.c:606 dmi_op_timeout(): failed read at 0x16, status=1
Debug: 12991 1060 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0x1071
Debug: 12992 1060 riscv-013.c:806 execute_abstract_command(): command 0x22100e failed; abstractcs=0x0
Debug: 13076 1060 riscv-013.c:397 scan(): 41b w 00000700 @16 -> + 00000000 @00; 4i
Debug: 13077 1060 riscv-013.c:407 scan():  cmderr=7 -> 
Debug: 13131 1060 riscv-013.c:397 scan(): 41b - 00000000 @16 -> + 00000700 @16; 4i
Debug: 13132 1060 riscv-013.c:407 scan():  ->  cmderr=7
Debug: 13133 1060 riscv.c:3450 riscv_get_register(): [riscv.tap.0] a4: ffffffffffffffff
Debug: 13134 1060 gdb_server.c:1483 gdb_error(): Reporting -4 to GDB as generic error
Debug: 13135 1060 gdb_server.c:406 gdb_log_outgoing_packet(): [riscv.tap.0] sending packet: $E0E#ba
timothytrippel commented 1 year ago

which contains an OpenOCD patch to dmireset on error (could be useful, not sure if we want to upstream that)

Nice, I think it would be good to upstream that (though it's not the most urgent thing here).

This is awesome work @pamaury ! Another thing we could do is just apply a patch to the openocd code at the bazel level since we vendor it in.

pamaury commented 1 year ago

I have an update on the flaky tests and this unfortunately points to a potentially worrying issue. In this test I simply ran 20 times a test where GDB connects to OpenOCD and does nothing. The GDB script is literally:

target extended-remote :3333
echo :::: Done.\\n

The bitstream used disables the ROM executions meaning that the ROM only executes ~10 instructions when booting and then loops:

_rom_start_boot:
  // Set up the global pointer and re-enable linker relaxations.
  la gp, __global_pointer$
  .option pop

LABEL_FOR_TEST(kRomStartBootMaybeHalt)
  // Check if we should halt here.
  li   a0, (TOP_EARLGREY_OTP_CTRL_CORE_BASE_ADDR + \
            OTP_CTRL_SW_CFG_WINDOW_REG_OFFSET)
  lw   t0, OTP_CTRL_PARAM_CREATOR_SW_CFG_ROM_EXEC_EN_OFFSET(a0)
  bnez t0, .L_exec_en
LABEL_FOR_TEST(kRomStartBootHalted)
.L_halt_loop:
  wfi
  j .L_halt_loop

Just doing this already triggers 4 failures out of 20 runs. The azure devops contains an artefact with the logs. The important thing to note here is that simply connecting GDB to OpenOCD actually triggers many operations to be performed like halting the core and reading registers.

timothytrippel commented 1 year ago

Can this issue be closed now, i.e., did https://github.com/lowRISC/opentitan/pull/18051 solve the remaining issues? Or are there tests still failing sporadically? If so which ones are still failing?

engdoreis commented 1 year ago

Closing this issue for now, as the issues found were fixed which improved substantially the gdb tests. There are still some flakiness, but I'll triage them and open a new issue if needed.

timothytrippel commented 1 year ago

Just to confirm @engdoreis , is there still flakiness with any of the GDB tests? or are different tests flaky?

engdoreis commented 1 year ago

Here are the latest 5 runs: image

So the last 2 runs, all the tests passed, but 4 tests failed before to this. Here are the logs for the March 26th fail.

[OPENOCD] Open On-Chip Debugger 0.12.0
[OPENOCD] Licensed under GNU GPL v2
[OPENOCD] For bug reports, read
[OPENOCD]   http://openocd.org/doc/doxygen/bugs.html
[OPENOCD] adapter srst delay: 10
[OPENOCD] Info : Hardware thread awareness created
[OPENOCD] force hard breakpoints
[OPENOCD] Info : Listening on port 6666 for tcl connections
[OPENOCD] Info : Listening on port 4444 for telnet connections
[OPENOCD] Error: libusb_open() failed with LIBUSB_ERROR_NO_DEVICE
[OPENOCD] Error: libusb_open() failed with LIBUSB_ERROR_NO_DEVICE
[OPENOCD] Error: libusb_open() failed with LIBUSB_ERROR_NO_DEVICE
[OPENOCD] Error: libusb_open() failed with LIBUSB_ERROR_NO_DEVICE
[OPENOCD] Info : clock speed 200 kHz
[OPENOCD] Info : JTAG tap: riscv.tap tap/device found: 0x0afaffff (mfg: 0x7ff (<invalid>), part: 0xafaf, ver: 0x0)
[OPENOCD] Warn : JTAG tap: riscv.tap       UNEXPECTED: 0x0afaffff (mfg: 0x7ff (<invalid>), part: 0xafaf, ver: 0x0)
[OPENOCD] Error: JTAG tap: riscv.tap  expected 1 of 1: 0x04f5484d (mfg: 0x426 (Google Inc), part: 0x4f54, ver: 0x0)
[OPENOCD] Info : JTAG tap: auto0.tap tap/device found: 0x04f5484d (mfg: 0x426 (Google Inc), part: 0x4f54, ver: 0x0)
[OPENOCD] Error: Trying to use configured scan chain anyway...
[OPENOCD] Warn : AUTO auto0.tap - use "jtag newtap auto0 tap -irlen 2 -expected-id 0x04f5484d"
[OPENOCD] Error: auto0.tap: IR capture error; saw 0x0003 not 0x0001
[OPENOCD] Warn : Bypassing JTAG setup events due to errors
[OPENOCD] Error: dtmcontrol is 0. Check JTAG connectivity/board power.
[OPENOCD] Warn : target riscv.tap.0 examination failed
[OPENOCD] Info : starting gdb server for riscv.tap.0 on 3333
Error: OpenOCD failed to examine the core.

I think we should watch the next runs and open issues if they keep failing.

timothytrippel commented 1 year ago

Thanks @engdoreis for the data! Let's keep this open and follow up with @pamaury, @andreaskurth, and @msfschaffner next week (I added a meeting to everyone's calendar).

andreaskurth commented 1 year ago

The error messages

[OPENOCD] Warn : JTAG tap: riscv.tap       UNEXPECTED: 0x0afaffff (mfg: 0x7ff (<invalid>), part: 0xafaf, ver: 0x0)

and

[OPENOCD] Error: auto0.tap: IR capture error; saw 0x0003 not 0x0001

seem to fall into the 'stuck-at-1' pattern. I assume OpenOCD puts the JTAG TAP into Test-Logic-Reset before it scans out IDCODE (though we should confirm that to be sure), which makes this failure special because I think it's the first time we're seeing this problem outside the Abstract Command mechanism (that problem is described in #18087). We don't know yet where the problem root lies, but I think this provides evidence that it's more likely in the JTAG (incl TAP implementation) part (or the I/O circuitry) than in the RISC-V-specific part of the debug module (and even less likely that it has something to do with Ibex). This is very valuable for root-causing the problem.

To get around this problem in the short term, we should be able to use the same technique as in #18051, though now in a different part of OpenOCD code. @pamaury WDYT?

pamaury commented 1 year ago

i think you are right, the IDCODE is just handled by the TAP and it's quite a strange output here. It also comes to mind that we haven't really tried to eliminate the jtag adapter as a source of error: is it possible that some (all?) of the adapters have a problem? One idea to check this theory would be to log the ID of the adapter on every run so that we can see if there is a correlation between errors and adapeters.

andreaskurth commented 1 year ago

It also comes to mind that we haven't really tried to eliminate the jtag adapter as a source of error: is it possible that some (all?) of the adapters have a problem? One idea to check this theory would be to log the ID of the adapter on every run so that we can see if there is a correlation between errors and adapeters.

I agree; we've reached a point where it makes sense to check whether the problem is correlated with one (or a few) specific JTAG adapters.

johngt commented 1 year ago

For everyone's knowledge. We are getting a CW310 up and running that will do continual tests for this. If there are no failures over a day or so then it pushes more towards there being a fault with individual JTAG adapters and boards. In parallel the failed runs will be examined to see if there is a common board identified with the failures. This is possible given how jobs are allocated in the same order (without randomisation across the boards) so quite frequently the same runs will land on the same board. If these two activities lead to a JTAG adapter being identified then it will be swapped out, otherwise further investigation will be needed. Per discussions on the thread the mitigation / software workaround seems practical, although not ideal. With more time this can be debugged further. If there are alternative suggestions then please continue to list / detail them. @luismarques / @jprendes for visibility

johngt commented 1 year ago

Initial investigations suggest that not just a single board might be affected but there is one board that is more affected.

andreaskurth commented 1 year ago

I'm working on getting an Integrated Logic Analyzer (ILA) core connected to the relevant JTAG signals on an FPGA. That's not trivial because synthesis and implementation algorithms mangle signals. Setting signal-specific don't touch and mark debug attributes didn't help, but marking the signals as debug via TCL commands in XDC files seems to help. A long turnaround time due to high resource utilization on the FPGA doesn't allow fast progress, but I'm slowly getting there ...

a-will commented 1 year ago

I'm working on getting an Integrated Logic Analyzer (ILA) core connected to the relevant JTAG signals on an FPGA. That's not trivial because synthesis and implementation algorithms mangle signals. Setting signal-specific don't touch and mark debug attributes didn't help, but marking the signals as debug via TCL commands in XDC files seems to help. A long turnaround time due to high resource utilization on the FPGA doesn't allow fast progress, but I'm slowly getting there ...

Note that there is also the option of bringing out signals to pins for an external logic analyzer, and unlike with an ILA, this can be done without disturbing the placement and routing of an existing design checkpoint. Since it needs unused, accessible pins on the board, it isn't able to inspect as many signals as you can with an ILA, but if you find adding the ILA makes the problem go away...

In the past, I did this for pwrmgr debug with Tcl commands like create_port, create_net, connect_net, etc. Once you have a script set up, turnaround time becomes significantly shorter (assuming you are able to readily find the signals of interest after the tool's renaming).

msfschaffner commented 1 year ago

One other note: You can generate an ILA core that you can instantiate in RTL pre-synthesis. I've used this in the past when synthesis mangled signals too much.

pamaury commented 1 year ago

While working on this, one issue I have identified (I think) and which is quite problematic is that if the ROM tries to boot but not succeed, it will reset the chip and try again. This can be the case in some tests where we do not bootstrap but do not disable ROM execution either. This could be the case of chip in production with flash erased. The problem is that then there is an inherent between between doing jtag operations and the reset, since the reset will also reset the jtag connection. With the RISC-V TAP, one can at least reset the core and halt immediately, but this still leaves the watchdog to disable I think, hence still racy. Am I missing something, is there a good solution to this?

a-will commented 1 year ago

While working on this, one issue I have identified (I think) and which is quite problematic is that if the ROM tries to boot but not succeed, it will reset the chip and try again. This can be the case in some tests where we do not bootstrap but do not disable ROM execution either. This could be the case of chip in production with flash erased. The problem is that then there is an inherent between between doing jtag operations and the reset, since the reset will also reset the jtag connection. With the RISC-V TAP, one can at least reset the core and halt immediately, but this still leaves the watchdog to disable I think, hence still racy. Am I missing something, is there a good solution to this?

ndmreset is supposed to reset the watchdog as well. If it doesn't, that would be a bug, I think.

Hm... but maybe JTAG could be interrupted while trying to issue commands to halt and reset. I guess that might still need a retry loop to verify that we have halted at the intended PC.

pamaury commented 1 year ago

Yes ndmreset does reset the watchdog. But as you pointed out, the issue that the chip could reset before we have the time to even issue this command. When that happens, the jtag connection will most likely fail because OpenOCD will get confused so we need to retry the whole connect to OpenOCD and try to stop thing. If the ROM is in a boot loop, it will reset after a very short time so the window to stop the core could be quite small. Furthemore, if we connect to the LC TAP, then we have no way to halt the core. Another failure I have observed is quite nasty:

a-will commented 1 year ago

Here we might have a troublesome collision between debug (and manufacturing access?) vs. security.

Is there something ROM should do (maybe together with TAP straps or SW straps?) to present a synchronization point for access? Or perhaps should hardware behavior change to prevent resetting the JTAG TAP by anything but the TAP controller when access is authorized (and the TAP straps select one)?

If we can successfully get through rom_init(), then setting the SW straps to enable bootstrapping does present such a synchronization point. But that doesn't help if you want to actually debug the early parts of the ROM.

pamaury commented 1 year ago

Maybe we could use the RMA software straps here because I think the purpose of those was essentially to give time to the software to connect over JTAG. One thing though that we already brough up is that the default value of the number of cycles (CREATOR_SW_CFG_RMA_SPIN_CYCLES, set to 10) to spin is so low that that's actually impossible to connect over jtag, at all.

Your point about (ab)using the ROM bootstrap is great though: if we set those pins, this gives up plenty of time to halt the core I think. One downside indeed is that if the low level init fails for some reason, the chip will probably reset early.

timothytrippel commented 1 year ago

Another option for the tests is to just bootstrap a signed "empty test", i.e., one that just returns true, so that this reset loop doesn't happen. Right now the plan is for the OTP setting for ROM execution to default to disabled, so this race condition shouldn't happen in production, but currently the OTP image for the test_unlocked LC states defaults to ROM exec enabled.

Maybe the test_unlocked* OTP images should be changed to reflect reality (i.e., have ROM execution set to disabled)? CC: @moidx

moidx commented 1 year ago

I prefer to have an empty test for now (in the interest of time), and have a focus test that has ROM_EXEC disabled in OTP. Changing the default for TestUnlocked* would require adding more overhead across e2e and top-level tests. I am ok with it, but not sure we have enough time to make that change this week.

pamaury commented 1 year ago

Update on stress tests after applying #18554: we still have failures but many follow a particular, very suspicious pattern:

[GDB] :::: Send OpenOCD the 'reset halt' command.
Debug: 814 1085 riscv-013.c:397 scan(): 41b w 80000001 @10 -> + 00000000 @00; 2i
Debug: 815 1085 riscv-013.c:407 scan(): haltreq dmactive -> 
[GDB] JTAG tap: riscv.tap tap/device found: 0x04f5484d (mfg: 0x426 (Google Inc), part: 0x4f54, ver: 0x0)
Debug: 816 1086 riscv-013.c:397 scan(): 41b - 00000000 @10 -> + 80000001 @10; 2i
Debug: 817 1086 riscv-013.c:407 scan():  -> haltreq dmactive
Debug: 818 1086 riscv-013.c:397 scan(): 41b w 80000003 @10 -> + 00000000 @10; 2i
Debug: 819 1086 riscv-013.c:407 scan(): haltreq ndmreset dmactive -> 
[CONSOLE] [2023-05-12T10:53:28Z INFO  opentitanlib::util::usb] Could not open device at bus=1 address=108: No such device (it may have been disconnected)
Debug: 820 1086 riscv-013.c:397 scan(): 41b - 00000000 @10 -> + 80000003 @10; 2i
Debug: 821 1086 riscv-013.c:407 scan():  -> haltreq ndmreset dmactive
Debug: 822 1086 command.c:155 script_debug(): command - riscv.tap.0 invoke-event reset-assert-post
Debug: 823 1086 command.c:155 script_debug(): command - riscv.tap.0 invoke-event reset-deassert-pre
Debug: 824 1086 command.c:155 script_debug(): command - transport select
Debug: 825 1086 command.c:155 script_debug(): command - expr  [ string first "jtag" $_TRANSPORT ] != -1 
Debug: 826 1086 command.c:155 script_debug(): command - riscv.tap.0 cget -chain-position
Debug: 827 1086 command.c:155 script_debug(): command - jtag tapisenabled riscv.tap
Debug: 828 1086 command.c:155 script_debug(): command - riscv.tap.0 arp_reset deassert 1
Debug: 829 1086 target.c:2199 target_free_all_working_areas_restore(): freeing all working areas
Debug: 830 1086 riscv.c:1283 riscv_deassert_reset(): [0]
[CONSOLE] [2023-05-12T10:53:28Z INFO  opentitanlib::util::usb] Could not open device at bus=1 address=12: No such device (it may have been disconnected)
[CONSOLE] [2023-05-12T10:53:28Z INFO  opentitanlib::util::usb] Could not open device at bus=1 address=52: No such device (it may have been disconnected)
Debug: 831 1087 riscv-013.c:397 scan(): 41b w 80000001 @10 -> + 00000000 @10; 2i
Debug: 832 1087 riscv-013.c:407 scan(): haltreq dmactive -> 
Debug: 833 1087 riscv-013.c:397 scan(): 41b - 00000000 @10 -> + fff00001 @7f; 2i
[CONSOLE] [2023-05-12T10:53:28Z INFO  opentitanlib::util::usb] Could not open device at bus=1 address=10: No such device (it may have been disconnected)
Debug: 834 1087 riscv-013.c:397 scan(): 41b w 80000001 @10 -> b ffffffff @7f; 2i
Debug: 835 1087 riscv-013.c:407 scan(): haltreq dmactive -> 
Debug: 836 1087 riscv-013.c:460 increase_dmi_busy_delay(): dtmcs_idle=1, dmi_busy_delay=3, ac_busy_delay=0
[CONSOLE] [2023-05-12T10:53:28Z DEBUG opentitanlib::transport::cw310::usb] WRITE_CTRL: bmRequestType: 41, bRequest: 34, wValue: 00a0, wIndex: 0000, data: [99, 2]
[CONSOLE] Debug: 837 1088 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0xffffffff
[2023-05-12T10:53:28Z DEBUG opentitanlib::transport::cw310::usb] WRITE_CTRL: bmRequestType: 41, bRequest: 34, wValue: 00a0, wIndex: 0000, data: [100, 2]
Debug: 838 1088 riscv-013.c:397 scan(): 41b w 80000001 @10 -> b ffffffff @7f; 3i
Debug: 839 1088 riscv-013.c:407 scan(): haltreq dmactive -> 
Debug: 840 1088 riscv-013.c:460 increase_dmi_busy_delay(): dtmcs_idle=1, dmi_busy_delay=4, ac_busy_delay=0
[CONSOLE] [2023-05-12T10:53:28Z DEBUG opentitanlib::transport::cw310::usb] WRITE_CTRL: bmRequestType: 41, bRequest: 34, wValue: 00a0, wIndex: 0000, data: [104, 2]
[CONSOLE] [2023-05-12T10:53:28Z DEBUG opentitanlib::transport::cw310::usb] WRITE_CTRL: bmRequestType: 41, bRequest: 34, wValue: 00a0, wIndex: 0000, data: [105, 2]
Debug: 841 1088 riscv-013.c:451 dtmcontrol_scan(): DTMCS: 0x10000 -> 0x1071

Here it looks like even though openocd was told not to use srst, something is reset, or maybe the strap pins are touched by opentitantool because the usb adapter is disconnected/reconnected. I am investigating this.