[rv_dm] `rv_dm_access_after_wakeup` FPGA failures

jwnrt commented 4 months ago

Description

This test (which runs in rma, dev, and test_unlocked1) has been failing on FPGAs since commit b2239fc38e0725f17b1155d7e48ec6403facf7f6.

That commit is almost certainly not the cause of the RV DM error, but the size change seems to have triggered a change in the FPGA routing and broken something.

The error comes from OpenOCD failing to connect to the debug module after the chip wakes from deep sleep.

Here are the parts of the test where the failure triggers:

Here's what OpenOCD says:

CMSIS-DAP: JTAG supported
CMSIS-DAP: FW Version = 2.1.1
CMSIS-DAP: Serial# = 204437845853
CMSIS-DAP: Interface Initialised (JTAG)
SWCLK/TCK = 0 SWDIO/TMS = 0 TDI = 0 TDO = 0 nTRST = 0 nRESET = 0
CMSIS-DAP: Interface ready
clock speed 1000 kHz
cmsis-dap JTAG TLR_RESET
cmsis-dap JTAG TLR_RESET
JTAG scan chain interrogation failed: all ones
Check JTAG interface, timings, target power, etc.
Trying to use configured scan chain anyway...
riscv.tap: IR capture error; saw 0x1f not 0x01
cmsis-dap JTAG TLR_RESET
Bypassing JTAG setup events due to errors
Unsupported DTM version: 15
target riscv.tap.0 examination failed
gdb port disabled

@a-will reports that this issue does not present in FPGA bitsreams built with Vivado version 2023 but it does with version 2021 that our CI uses. The lifecycle controller TAP is also not working.

andreaskurth commented 4 months ago

Thx for reporting this issue @jwnrt. Adding to M4 to ensure we resolve this in time.

This test (which runs in rma, dev, and test_unlocked1) has been failing on FPGAs since commit b2239fc.

Do you know if the parent commit (https://github.com/lowRISC/opentitan/commit/d77a3a32f975da034b29c9b56391b670eec46af8) is known good, i.e., the test passes in all LC states for that parent commit?

a-will commented 4 months ago

Thx for reporting this issue @jwnrt. Adding to M4 to ensure we resolve this in time.

This test (which runs in rma, dev, and test_unlocked1) has been failing on FPGAs since commit b2239fc.

Do you know if the parent commit (d77a3a3) is known good, i.e., the test passes in all LC states for that parent commit?

It does.

The commit where failures first appear seems to have merely triggered a latent bug, likely either in Vivado's synthesis / layout tools or in the timing of the JTAG enablement pathways.

andreaskurth commented 4 months ago

Ok, I'll take a closer look at the JTAG enablement pathways.

jwnrt commented 4 months ago

The test has started passing again with some recent RTL changes today, but it doesn't look like it was intentionally fixed. This could mean the issue still exists but is masked by a different routing on FPGAs?

a-will commented 4 months ago

The test has started passing again with some recent RTL changes today, but it doesn't look like it was intentionally fixed. This could mean the issue still exists but is masked by a different routing on FPGAs?

Yes, that's right. We don't know if it is just a tool bug or some timing problem, though.

timothytrippel commented 4 months ago

FYI: the following tests need to be re-activated in CI once this is addressed: https://github.com/lowRISC/opentitan/pull/22744/commits/bd3e4ed7969274eb9009fa413d3dbaa01b79b10c

vogelpi commented 3 months ago

@andreaskurth has been able to reproduce this but couldn't root cause this. Thinking that it could be a problem on ASIC but so far no indication of that DV is fine. Prioritizing other P0 and P1.

@a-will if timing related, it could be that stuff is handled better on ASIC because the SDCs are not the same.

@moidx do have test coverage in GLS.

Discussed to leave priority as is but we prioritize other P0s and P1s first.

vogelpi commented 3 months ago

@moidx it would be nice to capture the findings such that someone else can pick up the work if someone becomes available. @andreaskurth , would be able to document the steps taken please?

moidx commented 3 months ago

This may or may not be relevant:

--build-seed 104714960319679935410420483500971829136303708457300037460974663680452494898918

GitHub Revision: b29ffbb03c

VCS

UVM_FATAL @ * us: (chip_sw_rv_dm_access_after_wakeup_vseq.sv:56) [chip_sw_rv_dm_access_after_wakeup_vseq] Timed out waiting for device to enter normal sleep. has 3 failures:

Test chip_sw_rv_dm_access_after_wakeup has 3 failures.
0.chip_sw_rv_dm_access_after_wakeup.77787982882959533724642802343103680401343926437350432772420472162649361881555
Line 802, in log /container/opentitan-public/scratch/os_regression/chip_earlgrey_asic-sim-vcs/0.chip_sw_rv_dm_access_after_wakeup/latest/run.log

  UVM_FATAL @ 4575.453826 us: (chip_sw_rv_dm_access_after_wakeup_vseq.sv:56) [uvm_test_top.env.virtual_sequencer.chip_sw_rv_dm_access_after_wakeup_vseq] Timed out waiting for device to enter normal sleep.
  UVM_INFO @ 4575.453826 us: (uvm_report_catcher.svh:705) [UVM/REPORT/CATCHER]
  --- UVM Report catcher Summary ---

1.chip_sw_rv_dm_access_after_wakeup.42045925832267773038863112318651299469133308811198817911363044455600557074244
Line 780, in log /container/opentitan-public/scratch/os_regression/chip_earlgrey_asic-sim-vcs/1.chip_sw_rv_dm_access_after_wakeup/latest/run.log

  UVM_FATAL @ 3673.546356 us: (chip_sw_rv_dm_access_after_wakeup_vseq.sv:56) [uvm_test_top.env.virtual_sequencer.chip_sw_rv_dm_access_after_wakeup_vseq] Timed out waiting for device to enter normal sleep.
  UVM_INFO @ 3673.546356 us: (uvm_report_catcher.svh:705) [UVM/REPORT/CATCHER]

moidx commented 3 months ago

Moving to P2 as most critical use cases for rv_dm don't involve power transitions.

timothytrippel commented 3 months ago

They do involve software initiated resets and PORs, but no sleep / wake functionality. Can we remove the broken tags here and here then (to get these running in presubmit again)?