lowRISC / opentitan

OpenTitan: Open source silicon root of trust
https://www.opentitan.org
Apache License 2.0
2.44k stars 730 forks source link

[rtl,flash_ctrl] Fix squashed read request #23891

Open matutem opened 1 week ago

matutem commented 1 week ago

Description

In this case a flash read request for 9 bus words at address 0x04a00 generates an expected double ecc error near time 773657 * 100 ps. However, the next read for 0x04a08 is not really sent to the flash core, and

Looking at the flash prim IO,

The problem is probably in flash_phy_rd.

To reproduce this failure: ./util/dvsim/dvsim.py hw/top_earlgrey/ip_autogen/flash_ctrl/dv/flash_ctrl_sim_cfg.hjson -r 1 -v m -ro +max_quit_count=40 --waves fsdb -i flash_ctrl_rw_derr -ru -s 72066066308054948587157916832882385885089238640190272990405519719418074966861

The log of the error. UVM_ERROR @ 77905.7 ns: (flash_ctrl_otf_scoreboard.sv:289) uvm_test_top.env.m_otf_scb [process_read] unexpected double bit error 0x00004a08

Read address 0x00004a08 (word addr 0x941) @77375.7

johngt commented 5 days ago

To discuss in triage meeting on next steps for this.

andreaskurth commented 5 days ago

By @vogelpi : This behavior is similar to what we've seen on the controller interface. On the first error, the controller continues to read and expects data. This hasn't changed recently, though, so it shouldn't be a critical bug. Also it's only triggered by a double ECC error, so in normal operating conditions this should never occur.

nasahlpa commented 4 days ago

Not sure if this is the same error I have seen when working on #23564.

In my case, when a data_err_o was detected inside the flash_phy_rd module, this error gets routed up to the flash_rd_err_i inside flash_ctrl_rd.

Inside this module, a FSM is responsible for the error handling. When this error is raised, the FSM enters the StErr state: https://github.com/lowRISC/opentitan/blob/c301d77bc1619dd7c24e8756c80926b387eb5512/hw/top_earlgrey/ip_autogen/flash_ctrl/rtl/flash_ctrl_rd.sv#L150-L157 In my opinion, this FSM state contains a bug: If we have multiple read requests and the error happened somewhere in between, it enters this error FSM state and just raises the data_wr_o signal. When raising this signal, two things happen:

However, was we are not raising the flash_req_o signal inside flash_ctrl_rd, no flash request is transmitted to the flash_phy_rd module. This means whatever data is in the flash controller, it gets pushed into u_sw_rd_fifo - this could be some old read data for example.

vogelpi commented 4 days ago

Thanks @nasahlpa for chiming in. To me it sounds that both these FSMs simply abort any following requests on double or single bit errors and return whatever data they currently get from the Flash macro. This behavior is certainly not ideal but I don't think it's a critical bug. Because the behavior is so similar, I think it was a deliberate decision to implement it like that but we've failed to document this.

Fixing this for Earlgrey-PROD is out of scope in my view. Even if we could ECO this, there is a high risk of introducing other bugs. Moreover, since neither the current nor the intended behavior seems to be documented, we don't know what the fix should look like.

To me, this sounds like a FutureRelease / Backlog thing.

andreaskurth commented 4 days ago

Just discussed in triage: @vogelpi Can you please document this for M5 and link this issue from the PR? We can then move this issue to Backlog.

vogelpi commented 17 hours ago

I've investigated this and was able to reproduce another such failure using the following command (the command provided above didn't work for me):

./util/dvsim/dvsim.py hw/top_earlgrey/ip_autogen/flash_ctrl/dv/flash_ctrl_sim_cfg.hjson -v m -i flash_ctrl_rw_derr -mp 8 -r 1 -s 7783262009652151728048086876149679274750904586653904727972518016973989462944 --waves vpd

The result is:

* `UVM_ERROR (flash_ctrl_otf_scoreboard.sv:289) m_otf_scb [process_read] unexpected double bit error *` has 1 failures:
    * Test flash_ctrl_rw_derr has 1 failures.
        * 0.flash_ctrl_rw_derr.7783262009652151728048086876149679274750904586653904727972518016973989462944\
          Line 1183794, in log /home/pirmin/ot/opentitan/scratch/master/flash_ctrl-sim-vcs/0.flash_ctrl_rw_derr/latest/run.log

                UVM_ERROR @ 2759328.0 ns: (flash_ctrl_otf_scoreboard.sv:289) uvm_test_top.env.m_otf_scb [process_read] unexpected double bit error 0x00003800
                UVM_INFO @ 2759328.0 ns: (uvm_report_catcher.svh:705) [UVM/REPORT/CATCHER]
                --- UVM Report catcher Summary ---

I've now been investigating the waves and I can see exactly the thing happening described by @nasahlpa above. I'll now break for lunch and will continue afterwards to see why exactly the double bit error is happening.

vogelpi commented 13 hours ago

After some more debugging of additional cases, I can confirm the following:

Based on this, I think we should document the observed behavior (flash_ctrl_rd aborting on the first error). I've now filed a PR here to document this behavior: https://github.com/lowRISC/opentitan/pull/23974 .