[rtl,flash_ctrl] Fix squashed read request

matutem commented 1 week ago

Description

In this case a flash read request for 9 bus words at address 0x04a00 generates an expected double ecc error near time 773657 * 100 ps. However, the next read for 0x04a08 is not really sent to the flash core, and

Looking at the flash prim IO,

The address is not asserted with rd_req.
The data is still the prior read's data.
Descrambling depends on the address, so the mask is wrong and the data is most likely trash.
The testbench predicts a successful read, but the hardware is confused.

The problem is probably in flash_phy_rd.

To reproduce this failure: ./util/dvsim/dvsim.py hw/top_earlgrey/ip_autogen/flash_ctrl/dv/flash_ctrl_sim_cfg.hjson -r 1 -v m -ro +max_quit_count=40 --waves fsdb -i flash_ctrl_rw_derr -ru -s 72066066308054948587157916832882385885089238640190272990405519719418074966861

The log of the error. UVM_ERROR @ 77905.7 ns: (flash_ctrl_otf_scoreboard.sv:289) uvm_test_top.env.m_otf_scb [process_read] unexpected double bit error 0x00004a08

Read address 0x00004a08 (word addr 0x941) @77375.7

johngt commented 5 days ago

To discuss in triage meeting on next steps for this.

andreaskurth commented 5 days ago

By @vogelpi : This behavior is similar to what we've seen on the controller interface. On the first error, the controller continues to read and expects data. This hasn't changed recently, though, so it shouldn't be a critical bug. Also it's only triggered by a double ECC error, so in normal operating conditions this should never occur.

nasahlpa commented 4 days ago

Not sure if this is the same error I have seen when working on #23564.

In my case, when a data_err_o was detected inside the flash_phy_rd module, this error gets routed up to the flash_rd_err_i inside flash_ctrl_rd.

Inside this module, a FSM is responsible for the error handling. When this error is raised, the FSM enters the StErr state: https://github.com/lowRISC/opentitan/blob/c301d77bc1619dd7c24e8756c80926b387eb5512/hw/top_earlgrey/ip_autogen/flash_ctrl/rtl/flash_ctrl_rd.sv#L150-L157 In my opinion, this FSM state contains a bug: If we have multiple read requests and the error happened somewhere in between, it enters this error FSM state and just raises the data_wr_o signal. When raising this signal, two things happen:

The counter get incremented and we eventually leave this error state when we reached the counter limit.
The u_sw_rd_fifo gets pushed.

However, was we are not raising the flash_req_o signal inside flash_ctrl_rd, no flash request is transmitted to the flash_phy_rd module. This means whatever data is in the flash controller, it gets pushed into u_sw_rd_fifo - this could be some old read data for example.

vogelpi commented 4 days ago

Thanks @nasahlpa for chiming in. To me it sounds that both these FSMs simply abort any following requests on double or single bit errors and return whatever data they currently get from the Flash macro. This behavior is certainly not ideal but I don't think it's a critical bug. Because the behavior is so similar, I think it was a deliberate decision to implement it like that but we've failed to document this.

Fixing this for Earlgrey-PROD is out of scope in my view. Even if we could ECO this, there is a high risk of introducing other bugs. Moreover, since neither the current nor the intended behavior seems to be documented, we don't know what the fix should look like.

To me, this sounds like a FutureRelease / Backlog thing.

andreaskurth commented 4 days ago

Just discussed in triage: @vogelpi Can you please document this for M5 and link this issue from the PR? We can then move this issue to Backlog.

vogelpi commented 17 hours ago

I've investigated this and was able to reproduce another such failure using the following command (the command provided above didn't work for me):

./util/dvsim/dvsim.py hw/top_earlgrey/ip_autogen/flash_ctrl/dv/flash_ctrl_sim_cfg.hjson -v m -i flash_ctrl_rw_derr -mp 8 -r 1 -s 7783262009652151728048086876149679274750904586653904727972518016973989462944 --waves vpd

The result is:

* `UVM_ERROR (flash_ctrl_otf_scoreboard.sv:289) m_otf_scb [process_read] unexpected double bit error *` has 1 failures:
    * Test flash_ctrl_rw_derr has 1 failures.
        * 0.flash_ctrl_rw_derr.7783262009652151728048086876149679274750904586653904727972518016973989462944\
          Line 1183794, in log /home/pirmin/ot/opentitan/scratch/master/flash_ctrl-sim-vcs/0.flash_ctrl_rw_derr/latest/run.log

                UVM_ERROR @ 2759328.0 ns: (flash_ctrl_otf_scoreboard.sv:289) uvm_test_top.env.m_otf_scb [process_read] unexpected double bit error 0x00003800
                UVM_INFO @ 2759328.0 ns: (uvm_report_catcher.svh:705) [UVM/REPORT/CATCHER]
                --- UVM Report catcher Summary ---

I've now been investigating the waves and I can see exactly the thing happening described by @nasahlpa above. I'll now break for lunch and will continue afterwards to see why exactly the double bit error is happening.

vogelpi commented 13 hours ago

After some more debugging of additional cases, I can confirm the following:

In all cases investigated, a read of multiple words is happening via controller interface that results in a behavior similar to what @matutem and @nasahlpa described above: upon the first error for the controller read, the controller stops forwarding requests to the read pipeline and just returns the data currently output by flash_phy_rd. This can be
- An all-one vector in case of a memory property error (flash_rd_mp_err_i)
- All-zeros in case of a flash_rd_err_i (a read error reported by the read pipeline) if the read pipeline remains empty.
- Any other values in case of a flash_rd_err_i if the read pipeline keeps on doing reads, e.g. for the host interface (code fetches).
I haven't been able to produce exactly the case reported by @matutem, where the unexpected double bit error is reported on the second read of the sequence. The unexpected errors I observed were always reported on the first read.
I did check the scrambler inputs and FIFO fill levels and outputs inside the read pipeline in those cases. I couldn't see anything suspicious. The scrambler got the right address and descrambling was then done using the obtained mask and the data returned from the Flash primitive.

Based on this, I think we should document the observed behavior (flash_ctrl_rd aborting on the first error). I've now filed a PR here to document this behavior: https://github.com/lowRISC/opentitan/pull/23974 .

lowRISC / opentitan

[rtl,flash_ctrl] Fix squashed read request #23891

Description