Open matutem opened 1 week ago
To discuss in triage meeting on next steps for this.
By @vogelpi : This behavior is similar to what we've seen on the controller interface. On the first error, the controller continues to read and expects data. This hasn't changed recently, though, so it shouldn't be a critical bug. Also it's only triggered by a double ECC error, so in normal operating conditions this should never occur.
Not sure if this is the same error I have seen when working on #23564.
In my case, when a data_err_o
was detected inside the flash_phy_rd module, this error gets routed up to the flash_rd_err_i
inside flash_ctrl_rd.
Inside this module, a FSM is responsible for the error handling. When this error is raised, the FSM enters the StErr
state:
https://github.com/lowRISC/opentitan/blob/c301d77bc1619dd7c24e8756c80926b387eb5512/hw/top_earlgrey/ip_autogen/flash_ctrl/rtl/flash_ctrl_rd.sv#L150-L157
In my opinion, this FSM state contains a bug:
If we have multiple read requests and the error happened somewhere in between, it enters this error FSM state and just raises the data_wr_o
signal. When raising this signal, two things happen:
However, was we are not raising the flash_req_o
signal inside flash_ctrl_rd
, no flash request is transmitted to the flash_phy_rd
module. This means whatever data is in the flash controller, it gets pushed into u_sw_rd_fifo
- this could be some old read data for example.
Thanks @nasahlpa for chiming in. To me it sounds that both these FSMs simply abort any following requests on double or single bit errors and return whatever data they currently get from the Flash macro. This behavior is certainly not ideal but I don't think it's a critical bug. Because the behavior is so similar, I think it was a deliberate decision to implement it like that but we've failed to document this.
Fixing this for Earlgrey-PROD is out of scope in my view. Even if we could ECO this, there is a high risk of introducing other bugs. Moreover, since neither the current nor the intended behavior seems to be documented, we don't know what the fix should look like.
To me, this sounds like a FutureRelease / Backlog thing.
Just discussed in triage: @vogelpi Can you please document this for M5 and link this issue from the PR? We can then move this issue to Backlog.
I've investigated this and was able to reproduce another such failure using the following command (the command provided above didn't work for me):
./util/dvsim/dvsim.py hw/top_earlgrey/ip_autogen/flash_ctrl/dv/flash_ctrl_sim_cfg.hjson -v m -i flash_ctrl_rw_derr -mp 8 -r 1 -s 7783262009652151728048086876149679274750904586653904727972518016973989462944 --waves vpd
The result is:
* `UVM_ERROR (flash_ctrl_otf_scoreboard.sv:289) m_otf_scb [process_read] unexpected double bit error *` has 1 failures:
* Test flash_ctrl_rw_derr has 1 failures.
* 0.flash_ctrl_rw_derr.7783262009652151728048086876149679274750904586653904727972518016973989462944\
Line 1183794, in log /home/pirmin/ot/opentitan/scratch/master/flash_ctrl-sim-vcs/0.flash_ctrl_rw_derr/latest/run.log
UVM_ERROR @ 2759328.0 ns: (flash_ctrl_otf_scoreboard.sv:289) uvm_test_top.env.m_otf_scb [process_read] unexpected double bit error 0x00003800
UVM_INFO @ 2759328.0 ns: (uvm_report_catcher.svh:705) [UVM/REPORT/CATCHER]
--- UVM Report catcher Summary ---
I've now been investigating the waves and I can see exactly the thing happening described by @nasahlpa above. I'll now break for lunch and will continue afterwards to see why exactly the double bit error is happening.
After some more debugging of additional cases, I can confirm the following:
flash_phy_rd
. This can be
flash_rd_mp_err_i
)flash_rd_err_i
(a read error reported by the read pipeline) if the read pipeline remains empty.flash_rd_err_i
if the read pipeline keeps on doing reads, e.g. for the host interface (code fetches).Based on this, I think we should document the observed behavior (flash_ctrl_rd
aborting on the first error). I've now filed a PR here to document this behavior: https://github.com/lowRISC/opentitan/pull/23974 .
Description
In this case a flash read request for 9 bus words at address 0x04a00 generates an expected double ecc error near time 773657 * 100 ps. However, the next read for 0x04a08 is not really sent to the flash core, and
Looking at the flash prim IO,
The problem is probably in flash_phy_rd.
To reproduce this failure: ./util/dvsim/dvsim.py hw/top_earlgrey/ip_autogen/flash_ctrl/dv/flash_ctrl_sim_cfg.hjson -r 1 -v m -ro +max_quit_count=40 --waves fsdb -i flash_ctrl_rw_derr -ru -s 72066066308054948587157916832882385885089238640190272990405519719418074966861
The log of the error. UVM_ERROR @ 77905.7 ns: (flash_ctrl_otf_scoreboard.sv:289) uvm_test_top.env.m_otf_scb [process_read] unexpected double bit error 0x00004a08
Read address 0x00004a08 (word addr 0x941) @77375.7