Closed rswarbrick closed 2 months ago
Those failures are all due to enabling ECC more widely in block-level DV some time ago. Before doing this, the nightlies all looked good. The last RTL change was already before that.
In the meantime, @matutem has been adjusting more block-level DV sequences to let them deal with ECC errors correctly. There are still some to do but I am not to worried about this.
@vogelpi to update test list
The current status (May 19 nightly regression) is as follows:
Discussed in triage: moving to M5 as DV issue because RTL hasn't been changed in a while, many tests passed before enabling ECC in DV, and the remaining DV fails seem to be caused by DV problems rather than RTL problems
@matutem is still pending regression results for the following test cases:
flash_ctrl_read_word_sweep_derr
flash_ctrl_integrity
Moving this issue to M7, as just discussed in triage meeting. Those tests fail because ECC checks have recently been enabled, but flash_ctrl was already signed off at V2S without those checks. It is out of scope of M5 to bring flash_ctrl DV to a state in which all tests pass for more than 90% of the seeds with ECC checks enabled.
We think the risk for critical RTL bugs in flash_ctrl is lower than when it was signed off at V2S because DV now covers the ECC feature and most tests pass. Also, flash_ctrl is getting exercised in almost all top-level tests as well as in almost all SiVal tests, and no critical bugs have been found.
Based on last week's discussion, I decided it was useful to document the history of this and the current state in terms of regression failures and how all this aligns with the V2S signoff, as well as with the RTL and DV changes done for Earlgrey-PROD.
Below, you can see an overview of the regressions results of the last couple of months. Every column corresponds to one regression run, every row corresponds to one test (note that I had to split the many tests into four views, i.e., the stacked images). I've annotated the most important RTL and DV PRs to corresponding regression runs.
What we can see:
So to summarize:
Why ECC was switched off in so many tests is unclear and wasn't documented. We doubt that there was a good reason to disable this core feature in so many tests. Ideally, we would clean up all these failures of course but we shouldn't gate on this. Overall, I am convinced we are in a better shape on all sides compared to ES (DV, RTL, security) and I believe we shouldn't worry about Flash regarding Earlgrey-PROD.
FYI @andreaskurth , @matutem , @johngt, @moidx , @jonmichelson
Thanks for the summary @vogelpi
FYI, Look at the flash_ctrl tests based off the recent results.
Tests included in this ticket which fall below 100%:
80: Flash_ctrl_serr_counter Flash_ctrl_oversize_error Flash_ctrl_rw_evict Flash_ctrl_rw_derr Flash_ctrl_rw_serr
0%: flash_ctrl_derr_detect
We also have some outside of the ones flagged here:
Tests which fall below 100%:
80: Flash_ctrl_ro Flash_ctrl_rw Flash_ctrl_phy_ack_consistency
Hierarchy of regression failure
Block level
Failure Description
The nightly DV runs report lots of failing tests in
flash_ctrl
. They seem to have started breaking on 28th April, including the following tests:@matutem: I think you landed a large flash_ctrl PR a few days ago. Would you mind taking a look at some of these: I think your PR might well be responsible for the failures.
Steps to Reproduce
Tests with similar or related failures
No response