[otbn,dv] The `otbn_sec_cm` test is broken for all seeds

Description

A bit of bisection shows that this error was caused by:

574ece386f * [flash_ctrl/rtl] Enable SecFifoPtr for TL-UL FIFOs

To run a test: util/dvsim/dvsim.py hw/ip/otbn/dv/uvm/otbn_sim_cfg.hjson -i otbn_sec_cm --fixed-seed=123, which will fail with a message like:

* `xmsim: *E,ASRTST (/home/rjs/work/opentitan/scratch/tmp/otbn-sim-xcelium/default/src/lowrisc_prim_fifo_*/rtl/prim_fifo_sync.sv,151): Assertion DataKnown_A has failed` has 1 failures:
    * Test otbn_sec_cm has 1 failures.
        * 0.otbn_sec_cm.123\
          Line 150, in log /home/rjs/work/opentitan/scratch/tmp/otbn-sim-xcelium/0.otbn_sec_cm/latest/run.log

                xmsim: *E,ASRTST (/home/rjs/work/opentitan/scratch/tmp/otbn-sim-xcelium/default/src/lowrisc_prim_fifo_0/rtl/prim_fifo_sync.sv,151): (time 1014314 PS) Assertion tb.dut.u_tlul_adapter_sram_imem.u_sramreqfifo.DataKnown_A has failed
                UVM_ERROR @   1014314 ps: (prim_fifo_sync.sv:151) [ASSERT FAILED] DataKnown_A
                UVM_INFO @   1014314 ps: (uvm_report_catcher.svh:705) [UVM/REPORT/CATCHER]
                --- UVM Report catcher Summary ---

Taking a quick look at a log, this error appears just after:

UVM_INFO @   1009051 ps: (prim_count_if.sv:38) [tb.dut.u_tlul_adapter_sram_imem.u_sramreqfifo.gen_normal_fifo.u_fifo_cnt.gen_secure_ptrs.u_rptr.u_prim_count_if] Forcing tb.dut.u_tlul_adapter_sram_imem.u_sramreqfifo.gen_normal_fifo.u_fifo_cnt.gen_secure_ptrs.u_rptr.cnt_q[0] from 0 to 1

so presumably we're telling a FIFO that it has entries, but they actually turn out to be 'X.

My guess is that the assertion needs turning off in this situation with an $assertoff call.

@nasahlpa: I think this change is yours, and I notice that you've tweaked the TLUL adapter. Does my analysis above sound plausible to you?

I've put some effort into debugging this, but haven't got to the bottom of things yet. I've tried sprinkling "ignore this assertion" flags all over the place and I've got as far as... an alert coming out and no change in the STATUS register.

That seems to be an RTL bug, if I understand things correctly:

In the test, we're forcing a reqfifo counter from zero.
This is observed as a bus integrity violation on the DMEM side.
The signal comes out of tb.dut.u_tlul_adapter_sram_dmem as dmem_bus_intg_violation
That translates to a fatal error.
... but doesn't get factored into the current run state in any way.

I think this is a bug: I believe we should have the invariant that we will always go to the locked state after emitting a fatal error.

You can reproduce this by grabbing the otbn-sec-cm-debugging branch from my fork (https://github.com/rswarbrick/opentitan/tree/otbn-sec-cm-debugging) and running:

util/dvsim/dvsim.py hw/ip/otbn/dv/uvm/otbn_sim_cfg.hjson -i otbn_sec_cm --fixed-seed=123

The debug logs contain this:

UVM_INFO @  21955791 ps: (prim_count_if.sv:38) [tb.dut.u_tlul_adapter_sram_dmem.u_reqfifo.gen_normal_fifo.u_fifo_cnt.gen_secure_ptrs.u_rptr.u_prim_count_if] Forcing tb.dut.u_tlul_adapter_sram_dmem.u_reqfifo.gen_normal_fifo.u_fifo_cnt.gen_secure_ptrs.u_rptr.cnt_q[0] from 0 to 1
UVM_INFO @  21966317 ps: (cip_base_vseq__sec_cm_fi.svh:32) uvm_test_top.env.virtual_sequencer [uvm_test_top.env.virtual_sequencer.otbn_common_vseq] expected fatal alert is triggered for SecCmPrimCount
UVM_FATAL @  64075580 ps: (otbn_scoreboard.sv:497) uvm_test_top.env.scoreboard [uvm_test_top.env.scoreboard] A fatal alert arrived 4000 cycles ago and we still don't think it should have done.
UVM_INFO @  64075580 ps: (uvm_report_catcher.svh:705) [UVM/REPORT/CATCHER]

and the waves support the analysis I wrote above.

@GregAC: Would you mind taking a look at this? I'm rather hoping that I've misunderstood a bit of the OTBN security model.

Thanks for raising this and your effort debugging this!

I've only added the .Secure(1) option to the u_reqfifo and u_sramreqfifo: https://github.com/lowRISC/opentitan/blob/574ece386f821fa4bba1b390b859c2c32043845b/hw/ip/tlul/rtl/tlul_adapter_sram.sv#L526 https://github.com/lowRISC/opentitan/blob/574ece386f821fa4bba1b390b859c2c32043845b/hw/ip/tlul/rtl/tlul_adapter_sram.sv#L550 Before my change, this option only was set for the u_rspfifo: https://github.com/lowRISC/opentitan/blob/574ece386f821fa4bba1b390b859c2c32043845b/hw/ip/tlul/rtl/tlul_adapter_sram.sv#L576

As the issue you are seeing also affects other modules (c.f. #23567), I also tried debugging this but I could not find a solution. I wonder why it seems to work for the u_rspfifo FIFO but not for the other FIFOs.

In the worst case, we need to revert #23515 but from a sec. point of view it would be better having the Secure option also enabled for these FIFOs. WDYT @vogelpi?

I am investigating this.

@nasahlpa will now attempt to fix this. It would be better to keep #23515 .

I've now discussed this with Pascal and I believe there is a very easy way to fix this: inside tlul_sram_adapter.sv where the issue originates we use multiple prim_sync_fifo primitives with the OutputZeroIfEmpty parameter set to enabled (default). What now happens is that during these error tests, a fault is injected which makes the FIFO think it's non-empty and as a result, it will no longer output 0 (as per the parameter) but X.

My suggestion is to change the RTL of the primitive such as follows:

    if (OutputZeroIfEmpty == 1'b1) begin : gen_output_zero
      assign rdata_o = empty ? Width'(0) : rdata_int;
    end else begin : gen_no_output_zero
      assign rdata_o = rdata_int;
    end

    if (OutputZeroIfEmpty == 1'b1) begin : gen_output_zero
      assign rdata_o = empty || err_o ? Width'(0) : rdata_int;
    end else begin : gen_no_output_zero
      assign rdata_o = rdata_int;
    end

meaning in the error case, we output 0 instead of the X. This affects multiple primitives in the design but since the error signal also factors into a fatal alert, we don't add a security risk of outputting a known deterministic value. This will solve DV issues in multiple blocks that were introduced by turning the hardening of these FIFOs on.

I've now discussed this with Pascal and I believe there is a very easy way to fix this: inside tlul_sram_adapter.sv where the issue originates we use multiple prim_sync_fifo primitives with the OutputZeroIfEmpty parameter set to enabled (default). What now happens is that during these error tests, a fault is injected which makes the FIFO think it's non-empty and as a result, it will no longer output 0 (as per the parameter) but X.

My suggestion is to change the RTL of the primitive such as follows:
    if (OutputZeroIfEmpty == 1'b1) begin : gen_output_zero
      assign rdata_o = empty ? Width'(0) : rdata_int;
    end else begin : gen_no_output_zero
      assign rdata_o = rdata_int;
    end
to
    if (OutputZeroIfEmpty == 1'b1) begin : gen_output_zero
      assign rdata_o = empty || err_o ? Width'(0) : rdata_int;
    end else begin : gen_no_output_zero
      assign rdata_o = rdata_int;
    end
meaning in the error case, we output 0 instead of the X. This affects multiple primitives in the design but since the error signal also factors into a fatal alert, we don't add a security risk of outputting a known deterministic value. This will solve DV issues in multiple blocks that were introduced by turning the hardening of these FIFOs on.

Thanks for this suggestion. However, this does not quite work:

With this RTL change, we would output a 0 if the err_o is raised by the u_fifo_cnt and set the rvalid_o of the FIFO. This means that the TL-UL adapter assumes that this data is valid and sends the 0 back over the bus, causing several issues (TL-UL exception as e.g. size/user fields do not match the expected value).

Instead, we should change:

    assign rvalid_o = ~empty & ~under_rst;

    assign rvalid_o = ~empty & ~under_rst & ~err_o;

In addition to the 0 output on err_o as suggested.

@rswarbrick @nasahlpa @vogelpi spent some time digging into this, here's what's happening:

The FIFO which holds request data for the tilelink SRAM adapter has a fault inserted in one of its counters, this causes the FIFO to falsely believe it has data in it, causing rvalid_o to be raised but crucially the data it outputs is X (as the read pointer counter points at a FIFO entry that hasn't had valid data written into it)
The error in the counter triggers an alert and also triggers an escalation request to OTBN
The Xs from the request data combined with the false rvalid_o result in Xs going back to the request data's FIFO rready_i input
- This occurs because tlul_adapter_sram produces its dvalid for the tilelink response based upon the output of the request FIFO. Specifically if the request FIFO output data has its error field set we want an immediate response (here the error means some issue with the incoming tl request, e.g. it fails a tlul protocol check or it's trying to do a byte write and byte writes aren't allowed)
- We pop from the request FIFO when the response is sent hence the dvalid factors into the request FIFO rready_i
The X on the rready_i means the internal counters go to X (they don't know if they're changing or not)
This results in Xs all over the place internally to the request FIFO, resulting in the error output of the request FIFO going X
The error eventually factors into the escalate input escalate_en_i in OTBN core, which goes to X
escalate_en_i being X means the state of otbn_start_stop_controller goes to X (The state machine in otbn_controller has already gone to locked due to the initial error)

So we see an initial escalate which causes the OTBN controller to moved to a locked state but otbn_start_stop_controller is still doing it's initial secure wipe as it's just come out of reset (the test injects a fault then resets DUT then repeats), so the escalate doesn't effect anything immediately (other than setting should_lock_q). escalate_en_i going to X messes things up because of these lines:

https://github.com/lowRISC/opentitan/blob/aa66b4b46c0025229e100e6f79da487c4a9985e1/hw/ip/otbn/rtl/otbn_start_stop_control.sv#L360-L369

We have a unconditional move to the locked state where the escalate_en_i MUBI value is invalid. This is reasonable but the problem is where escalate_en_i is X we end up with an uncertain state!

So we get an initial alert from the fault, this escalates to OTBN which should lock, but never actually signals lock because of the above which means the otbn_start_stop_control state goes to X, which in particular means we never get the locking_o signal output.

The question now is what, if anything, should we fix in RTL (especially as we're into RTL freeze but could maybe slip in a small change)? There's a few things we could do:

@vogelpi's suggested fix, which would ensure the data out of a FIFO isn't X under error conditions, this would prevent the whole chain of events occurring, we'd just the alert and OTBN locking as expected. I think it's also low risk as it's not altering any control flows (the rvalid behaviour will not change). Would also be easy to remove via an ECO if we decided it has other negative impacts (possibly easy to add as well, given we've already got logic in there that does the zeroing we just need to add a new way to trigger it).
Attempt to isolate the Xs within tlul_adapter_sram, e.g. ignoring data from the request FIFO if the request FIFO has an error, so we'd get 0 for dvalid in this case and no Xs going back into the request FIFO control
Add logic into the otbn_start_stop_control state machine that stops testing escalate_en_i for invalid values once we know we're locking anyway (e.g. because we've already seen a valid escalation request)
Not worry that Xs on escalate_en_i in this case cause our otbn_start_stop_control state to go to X. We know here the escalation has been triggered and we will lock in any scenario with the real hardware (e.g. consider possibilities if escalate_en_i was a different random, non X value every cycle following the initial escalation where it's MUBI4True for a cycle)

I think 2. is likely similar to @nasahlpa's initial suggested fix (https://github.com/lowRISC/opentitan/pull/23595) which was abandoned due to it causing TLUL timeout issues under other scenarios (where a response that should have gone out gets killed), though perhaps by doing it in tlul_adapter_sram instead we could avoid those problems

should probably come with some extra testing to verify that once we've had at least one cycle of escalation is MUBI4True we do eventually reach a fully locked state regardless of what escalate_en_i values we see.

Ultimately I think it does come down to gaining confidence in the escalate_en_i behaviour. The problem is if we get an actual invalid MUBI value we'll go to the locked state sooner, but we would have gone there anyway eventually. Due to the way the simulator handles Xs though we cannot confirm this (in effect we'd need the sim to fork and run one thread going down the state machine assuming we continue to have a valid escalate_en_i value and the other assuming we continue to have an invalid escalate_en_i value, finding they both eventually end up with the state going to the same value OtbnStartStopStateLocked and thus eventually resolving the state to that value).

My recommendation would be to consider @vogelpi's suggested fix for late inclusion, and to consider ways to test this escalate_en_i behaviour (perhaps we can do something hacky that allows us to force it to random values every cycle once we've seen it go X or similar?).

lowRISC / opentitan

[otbn,dv] The `otbn_sec_cm` test is broken for all seeds #23578

Description