lowRISC / opentitan

OpenTitan: Open source silicon root of trust
https://www.opentitan.org
Apache License 2.0
2.53k stars 754 forks source link

[otbn,dv] The `otbn_sec_cm` test is broken for all seeds #23578

Closed rswarbrick closed 3 months ago

rswarbrick commented 3 months ago

Description

A bit of bisection shows that this error was caused by:

574ece386f * [flash_ctrl/rtl] Enable SecFifoPtr for TL-UL FIFOs

To run a test: util/dvsim/dvsim.py hw/ip/otbn/dv/uvm/otbn_sim_cfg.hjson -i otbn_sec_cm --fixed-seed=123, which will fail with a message like:

* `xmsim: *E,ASRTST (/home/rjs/work/opentitan/scratch/tmp/otbn-sim-xcelium/default/src/lowrisc_prim_fifo_*/rtl/prim_fifo_sync.sv,151): Assertion DataKnown_A has failed` has 1 failures:
    * Test otbn_sec_cm has 1 failures.
        * 0.otbn_sec_cm.123\
          Line 150, in log /home/rjs/work/opentitan/scratch/tmp/otbn-sim-xcelium/0.otbn_sec_cm/latest/run.log

                xmsim: *E,ASRTST (/home/rjs/work/opentitan/scratch/tmp/otbn-sim-xcelium/default/src/lowrisc_prim_fifo_0/rtl/prim_fifo_sync.sv,151): (time 1014314 PS) Assertion tb.dut.u_tlul_adapter_sram_imem.u_sramreqfifo.DataKnown_A has failed
                UVM_ERROR @   1014314 ps: (prim_fifo_sync.sv:151) [ASSERT FAILED] DataKnown_A
                UVM_INFO @   1014314 ps: (uvm_report_catcher.svh:705) [UVM/REPORT/CATCHER]
                --- UVM Report catcher Summary ---

Taking a quick look at a log, this error appears just after:

UVM_INFO @   1009051 ps: (prim_count_if.sv:38) [tb.dut.u_tlul_adapter_sram_imem.u_sramreqfifo.gen_normal_fifo.u_fifo_cnt.gen_secure_ptrs.u_rptr.u_prim_count_if] Forcing tb.dut.u_tlul_adapter_sram_imem.u_sramreqfifo.gen_normal_fifo.u_fifo_cnt.gen_secure_ptrs.u_rptr.cnt_q[0] from 0 to 1

so presumably we're telling a FIFO that it has entries, but they actually turn out to be 'X.

My guess is that the assertion needs turning off in this situation with an $assertoff call.

@nasahlpa: I think this change is yours, and I notice that you've tweaked the TLUL adapter. Does my analysis above sound plausible to you?

rswarbrick commented 3 months ago

I've put some effort into debugging this, but haven't got to the bottom of things yet. I've tried sprinkling "ignore this assertion" flags all over the place and I've got as far as... an alert coming out and no change in the STATUS register.

That seems to be an RTL bug, if I understand things correctly:

I think this is a bug: I believe we should have the invariant that we will always go to the locked state after emitting a fatal error.

You can reproduce this by grabbing the otbn-sec-cm-debugging branch from my fork (https://github.com/rswarbrick/opentitan/tree/otbn-sec-cm-debugging) and running:

util/dvsim/dvsim.py hw/ip/otbn/dv/uvm/otbn_sim_cfg.hjson -i otbn_sec_cm --fixed-seed=123

The debug logs contain this:

UVM_INFO @  21955791 ps: (prim_count_if.sv:38) [tb.dut.u_tlul_adapter_sram_dmem.u_reqfifo.gen_normal_fifo.u_fifo_cnt.gen_secure_ptrs.u_rptr.u_prim_count_if] Forcing tb.dut.u_tlul_adapter_sram_dmem.u_reqfifo.gen_normal_fifo.u_fifo_cnt.gen_secure_ptrs.u_rptr.cnt_q[0] from 0 to 1
UVM_INFO @  21966317 ps: (cip_base_vseq__sec_cm_fi.svh:32) uvm_test_top.env.virtual_sequencer [uvm_test_top.env.virtual_sequencer.otbn_common_vseq] expected fatal alert is triggered for SecCmPrimCount
UVM_FATAL @  64075580 ps: (otbn_scoreboard.sv:497) uvm_test_top.env.scoreboard [uvm_test_top.env.scoreboard] A fatal alert arrived 4000 cycles ago and we still don't think it should have done.
UVM_INFO @  64075580 ps: (uvm_report_catcher.svh:705) [UVM/REPORT/CATCHER] 

and the waves support the analysis I wrote above.

rswarbrick commented 3 months ago

@GregAC: Would you mind taking a look at this? I'm rather hoping that I've misunderstood a bit of the OTBN security model.

nasahlpa commented 3 months ago

Thanks for raising this and your effort debugging this!

I've only added the .Secure(1) option to the u_reqfifo and u_sramreqfifo: https://github.com/lowRISC/opentitan/blob/574ece386f821fa4bba1b390b859c2c32043845b/hw/ip/tlul/rtl/tlul_adapter_sram.sv#L526 https://github.com/lowRISC/opentitan/blob/574ece386f821fa4bba1b390b859c2c32043845b/hw/ip/tlul/rtl/tlul_adapter_sram.sv#L550 Before my change, this option only was set for the u_rspfifo: https://github.com/lowRISC/opentitan/blob/574ece386f821fa4bba1b390b859c2c32043845b/hw/ip/tlul/rtl/tlul_adapter_sram.sv#L576

As the issue you are seeing also affects other modules (c.f. #23567), I also tried debugging this but I could not find a solution. I wonder why it seems to work for the u_rspfifo FIFO but not for the other FIFOs.

In the worst case, we need to revert #23515 but from a sec. point of view it would be better having the Secure option also enabled for these FIFOs. WDYT @vogelpi?

vogelpi commented 3 months ago

I am investigating this.

vogelpi commented 3 months ago

@nasahlpa will now attempt to fix this. It would be better to keep #23515 .

vogelpi commented 3 months ago

I've now discussed this with Pascal and I believe there is a very easy way to fix this: inside tlul_sram_adapter.sv where the issue originates we use multiple prim_sync_fifo primitives with the OutputZeroIfEmpty parameter set to enabled (default). What now happens is that during these error tests, a fault is injected which makes the FIFO think it's non-empty and as a result, it will no longer output 0 (as per the parameter) but X.

My suggestion is to change the RTL of the primitive such as follows:

    if (OutputZeroIfEmpty == 1'b1) begin : gen_output_zero
      assign rdata_o = empty ? Width'(0) : rdata_int;
    end else begin : gen_no_output_zero
      assign rdata_o = rdata_int;
    end

to

    if (OutputZeroIfEmpty == 1'b1) begin : gen_output_zero
      assign rdata_o = empty || err_o ? Width'(0) : rdata_int;
    end else begin : gen_no_output_zero
      assign rdata_o = rdata_int;
    end

meaning in the error case, we output 0 instead of the X. This affects multiple primitives in the design but since the error signal also factors into a fatal alert, we don't add a security risk of outputting a known deterministic value. This will solve DV issues in multiple blocks that were introduced by turning the hardening of these FIFOs on.

nasahlpa commented 3 months ago

I've now discussed this with Pascal and I believe there is a very easy way to fix this: inside tlul_sram_adapter.sv where the issue originates we use multiple prim_sync_fifo primitives with the OutputZeroIfEmpty parameter set to enabled (default). What now happens is that during these error tests, a fault is injected which makes the FIFO think it's non-empty and as a result, it will no longer output 0 (as per the parameter) but X.

My suggestion is to change the RTL of the primitive such as follows:

    if (OutputZeroIfEmpty == 1'b1) begin : gen_output_zero
      assign rdata_o = empty ? Width'(0) : rdata_int;
    end else begin : gen_no_output_zero
      assign rdata_o = rdata_int;
    end

to

    if (OutputZeroIfEmpty == 1'b1) begin : gen_output_zero
      assign rdata_o = empty || err_o ? Width'(0) : rdata_int;
    end else begin : gen_no_output_zero
      assign rdata_o = rdata_int;
    end

meaning in the error case, we output 0 instead of the X. This affects multiple primitives in the design but since the error signal also factors into a fatal alert, we don't add a security risk of outputting a known deterministic value. This will solve DV issues in multiple blocks that were introduced by turning the hardening of these FIFOs on.

Thanks for this suggestion. However, this does not quite work:

With this RTL change, we would output a 0 if the err_o is raised by the u_fifo_cnt and set the rvalid_o of the FIFO. This means that the TL-UL adapter assumes that this data is valid and sends the 0 back over the bus, causing several issues (TL-UL exception as e.g. size/user fields do not match the expected value).

Instead, we should change:

    assign rvalid_o = ~empty & ~under_rst;

to

    assign rvalid_o = ~empty & ~under_rst & ~err_o;

In addition to the 0 output on err_o as suggested.

GregAC commented 3 months ago

@rswarbrick @nasahlpa @vogelpi spent some time digging into this, here's what's happening:

So we see an initial escalate which causes the OTBN controller to moved to a locked state but otbn_start_stop_controller is still doing it's initial secure wipe as it's just come out of reset (the test injects a fault then resets DUT then repeats), so the escalate doesn't effect anything immediately (other than setting should_lock_q). escalate_en_i going to X messes things up because of these lines:

https://github.com/lowRISC/opentitan/blob/aa66b4b46c0025229e100e6f79da487c4a9985e1/hw/ip/otbn/rtl/otbn_start_stop_control.sv#L360-L369

We have a unconditional move to the locked state where the escalate_en_i MUBI value is invalid. This is reasonable but the problem is where escalate_en_i is X we end up with an uncertain state!

So we get an initial alert from the fault, this escalates to OTBN which should lock, but never actually signals lock because of the above which means the otbn_start_stop_control state goes to X, which in particular means we never get the locking_o signal output.

The question now is what, if anything, should we fix in RTL (especially as we're into RTL freeze but could maybe slip in a small change)? There's a few things we could do:

  1. @vogelpi's suggested fix, which would ensure the data out of a FIFO isn't X under error conditions, this would prevent the whole chain of events occurring, we'd just the alert and OTBN locking as expected. I think it's also low risk as it's not altering any control flows (the rvalid behaviour will not change). Would also be easy to remove via an ECO if we decided it has other negative impacts (possibly easy to add as well, given we've already got logic in there that does the zeroing we just need to add a new way to trigger it).
  2. Attempt to isolate the Xs within tlul_adapter_sram, e.g. ignoring data from the request FIFO if the request FIFO has an error, so we'd get 0 for dvalid in this case and no Xs going back into the request FIFO control
  3. Add logic into the otbn_start_stop_control state machine that stops testing escalate_en_i for invalid values once we know we're locking anyway (e.g. because we've already seen a valid escalation request)
  4. Not worry that Xs on escalate_en_i in this case cause our otbn_start_stop_control state to go to X. We know here the escalation has been triggered and we will lock in any scenario with the real hardware (e.g. consider possibilities if escalate_en_i was a different random, non X value every cycle following the initial escalation where it's MUBI4True for a cycle)

I think 2. is likely similar to @nasahlpa's initial suggested fix (https://github.com/lowRISC/opentitan/pull/23595) which was abandoned due to it causing TLUL timeout issues under other scenarios (where a response that should have gone out gets killed), though perhaps by doing it in tlul_adapter_sram instead we could avoid those problems

  1. should probably come with some extra testing to verify that once we've had at least one cycle of escalation is MUBI4True we do eventually reach a fully locked state regardless of what escalate_en_i values we see.

Ultimately I think it does come down to gaining confidence in the escalate_en_i behaviour. The problem is if we get an actual invalid MUBI value we'll go to the locked state sooner, but we would have gone there anyway eventually. Due to the way the simulator handles Xs though we cannot confirm this (in effect we'd need the sim to fork and run one thread going down the state machine assuming we continue to have a valid escalate_en_i value and the other assuming we continue to have an invalid escalate_en_i value, finding they both eventually end up with the state going to the same value OtbnStartStopStateLocked and thus eventually resolving the state to that value).

My recommendation would be to consider @vogelpi's suggested fix for late inclusion, and to consider ways to test this escalate_en_i behaviour (perhaps we can do something hacky that allows us to force it to random values every cycle once we've seen it go X or similar?).