lowRISC / opentitan

OpenTitan: Open source silicon root of trust
https://www.opentitan.org
Apache License 2.0
2.53k stars 754 forks source link

[flash_ctrl,dv] Lots of broken block-level tests in flash_ctrl #22879

Closed rswarbrick closed 2 months ago

rswarbrick commented 5 months ago

Hierarchy of regression failure

Block level

Failure Description

The nightly DV runs report lots of failing tests in flash_ctrl. They seem to have started breaking on 28th April, including the following tests:

@matutem: I think you landed a large flash_ctrl PR a few days ago. Would you mind taking a look at some of these: I think your PR might well be responsible for the failures.

Steps to Reproduce

Tests with similar or related failures

No response

vogelpi commented 4 months ago

Those failures are all due to enabling ECC more widely in block-level DV some time ago. Before doing this, the nightlies all looked good. The last RTL change was already before that.

In the meantime, @matutem has been adjusting more block-level DV sequences to let them deal with ECC errors correctly. There are still some to do but I am not to worried about this.

johngt commented 4 months ago

@vogelpi to update test list

vogelpi commented 4 months ago

The current status (May 19 nightly regression) is as follows:

andreaskurth commented 4 months ago

Discussed in triage: moving to M5 as DV issue because RTL hasn't been changed in a while, many tests passed before enabling ECC in DV, and the remaining DV fails seem to be caused by DV problems rather than RTL problems

moidx commented 3 months ago

@matutem is still pending regression results for the following test cases:

  1. flash_ctrl_read_word_sweep_derr
  2. flash_ctrl_integrity
andreaskurth commented 2 months ago

Moving this issue to M7, as just discussed in triage meeting. Those tests fail because ECC checks have recently been enabled, but flash_ctrl was already signed off at V2S without those checks. It is out of scope of M5 to bring flash_ctrl DV to a state in which all tests pass for more than 90% of the seeds with ECC checks enabled.

We think the risk for critical RTL bugs in flash_ctrl is lower than when it was signed off at V2S because DV now covers the ECC feature and most tests pass. Also, flash_ctrl is getting exercised in almost all top-level tests as well as in almost all SiVal tests, and no critical bugs have been found.

vogelpi commented 2 months ago

Based on last week's discussion, I decided it was useful to document the history of this and the current state in terms of regression failures and how all this aligns with the V2S signoff, as well as with the RTL and DV changes done for Earlgrey-PROD.

Below, you can see an overview of the regressions results of the last couple of months. Every column corresponds to one regression run, every row corresponds to one test (note that I had to split the many tests into four views, i.e., the stacked images). I've annotated the most important RTL and DV PRs to corresponding regression runs.

flash_ctrl_regression_v2s_1_prs flash_ctrl_regression_v2s_2_prs flash_ctrl_regression_v2s_3 flash_ctrl_regression_v2s_4

What we can see:

So to summarize:

Why ECC was switched off in so many tests is unclear and wasn't documented. We doubt that there was a good reason to disable this core feature in so many tests. Ideally, we would clean up all these failures of course but we shouldn't gate on this. Overall, I am convinced we are in a better shape on all sides compared to ES (DV, RTL, security) and I believe we shouldn't worry about Flash regarding Earlgrey-PROD.

FYI @andreaskurth , @matutem , @johngt, @moidx , @jonmichelson

johngt commented 2 months ago

Thanks for the summary @vogelpi

hayleynewton commented 2 months ago

FYI, Look at the flash_ctrl tests based off the recent results.

Tests included in this ticket which fall below 100%:

80: Flash_ctrl_serr_counter Flash_ctrl_oversize_error Flash_ctrl_rw_evict Flash_ctrl_rw_derr Flash_ctrl_rw_serr

0%: flash_ctrl_derr_detect

We also have some outside of the ones flagged here:

Tests which fall below 100%:

80: Flash_ctrl_ro Flash_ctrl_rw Flash_ctrl_phy_ack_consistency