[test-triage] chip_sw_sram_ctrl_main_scrambled_access_jitter_en

tjaychen commented 2 years ago

Hierarchy of regression failure

Chip Level

Failure Description

UVM_ERROR @ 4510.502520 us: (sw_logger_if.sv:521) [sram_ctrl_main_scrambled_access_test_prog_sim_dv(sw/device/tests/sim_dv/sram_ctrl_main_scrambled_access_test.c:251)] CHECK-fail: mmio_region_read32(mmio_region_from_addr( TOP_EARLGREY_SRAM_CTRL_RET_AON_RAM_BASE_ADDR), INTEGRITY_EXCEPTION_COUNT_OFFSET) == SRAM_CTRL_TESTUTILS_DATA_NUM_WORDS UVM_INFO @ 4510.502520 us: (uvm_report_catcher.svh:705) [UVM/REPORT/CATCHER]

Steps to Reproduce

https://github.com/lowrisc/opentitan/tree/51417ec839826e2ca585eac843f25bcbb7f8ec9f
./util/dvsim/dvsim.py -t vcs -w fsdb hw/top_earlgrey/dv/chip_sim_cfg.hjson -i -s 2199609583 -r 1 --build-seed 3859254937

Tests with similar or related failures

tjaychen commented 2 years ago

assign to @tjaychen for initial triage.

tjaychen commented 2 years ago

After some investigation, it appears the test flow is roughly as follows:

provision an error counting offset in the retention sram.
request a new scrambling key (across reboots), and read data that has been backdoor written into the main sram using the previous key
since the scrambling key is now new, the de-scrambled data is incorrect and should fail integrity checks
on each integrity failure (internal exception to the ibex), increment the error counting offset in retention sram
check there are expected number of errors in the retention sram.

This works most of the time. In this particular seed, 1 out of 4 incorrectly de-scrambled data happens to pass integrity checks (this is expected eventually and has been observed by @GregAC elsewhere). As a result, instead of 4 mismatched counts we get 3, and thus the test fails.

Relying on integrity failures like this across randomized keys is probably not going to work 100% of the time. It might be a better idea, for the purposes of this test, to silence the integrity errors via testbench forcing, and just rely on the data itself being incorrect (since the de-scrambled results will essentially never match). That might make the test more robust.

I'll leave it to the next on-call to decide the next best steps.

tjaychen commented 2 years ago

unassign @tjaychen for next on-call.

johngt commented 2 years ago

Assigning to @engdoreis to look into this. This may be a task for @andreaskurth at some point but it might be best to see if @abdullahvarici and @engdoreis can perform a first pass on this. If it will take more than a couple of hours of work, best to track as an issue; if that is the case ping me and we will create a separate issue for tracking this effort.

engdoreis commented 2 years ago

I tested again with the HEAD of master and the test passed with the following command:

./util/dvsim/dvsim.py hw/top_earlgrey/dv/chip_sim_cfg.hjson -i chip_sw_sram_ctrl_main_scrambled_access_jitter_en -s 2199609583 -r 1 --build-seed 3859254937

I didn't go dip to discover which commit fixed it. But I beleive it was one of Greg's commits. @johngt should we close this issue?

johngt commented 2 years ago

Thanks @engdoreis - yes this looks like this was resolved after https://reports.opentitan.org/hw/top_earlgrey/dv/2022.08.15_16.33.18/results.html That was the last failed result and since then it has been passing on the nightlies. Closing out as this error no longer applies.

tjaychen commented 2 years ago

wait, i don't think we should just okay this because it was passing in nightly. Please see my comment above, this will fail again if we don't think about how to address it. It's more or less just a matter of probability.

engdoreis commented 2 years ago

Sorry, I missed that. I proposing two changes in this PR #14733 to make the test more robust.

Silence the integrity errors via testbench as @tjaychen proposed and was done in the chip_sw_otbn_mem_scramble test.
Expect that some random keys may not generate ECC errors for all data, so tolerating some false positives.

tjaychen commented 2 years ago

thanks @engdoreis i'll have a look.

johngt commented 2 years ago

Unassigning from lowRISC / myself as our on-call has ended. @tjaychen may want to revisit @engdoreis proposed PR.

lowRISC / opentitan