Closed rswarbrick closed 3 months ago
I've put some effort into debugging this, but haven't got to the bottom of things yet. I've tried sprinkling "ignore this assertion" flags all over the place and I've got as far as... an alert coming out and no change in the STATUS register.
That seems to be an RTL bug, if I understand things correctly:
tb.dut.u_tlul_adapter_sram_dmem
as dmem_bus_intg_violation
I think this is a bug: I believe we should have the invariant that we will always go to the locked state after emitting a fatal error.
You can reproduce this by grabbing the otbn-sec-cm-debugging
branch from my fork (https://github.com/rswarbrick/opentitan/tree/otbn-sec-cm-debugging) and running:
util/dvsim/dvsim.py hw/ip/otbn/dv/uvm/otbn_sim_cfg.hjson -i otbn_sec_cm --fixed-seed=123
The debug logs contain this:
UVM_INFO @ 21955791 ps: (prim_count_if.sv:38) [tb.dut.u_tlul_adapter_sram_dmem.u_reqfifo.gen_normal_fifo.u_fifo_cnt.gen_secure_ptrs.u_rptr.u_prim_count_if] Forcing tb.dut.u_tlul_adapter_sram_dmem.u_reqfifo.gen_normal_fifo.u_fifo_cnt.gen_secure_ptrs.u_rptr.cnt_q[0] from 0 to 1
UVM_INFO @ 21966317 ps: (cip_base_vseq__sec_cm_fi.svh:32) uvm_test_top.env.virtual_sequencer [uvm_test_top.env.virtual_sequencer.otbn_common_vseq] expected fatal alert is triggered for SecCmPrimCount
UVM_FATAL @ 64075580 ps: (otbn_scoreboard.sv:497) uvm_test_top.env.scoreboard [uvm_test_top.env.scoreboard] A fatal alert arrived 4000 cycles ago and we still don't think it should have done.
UVM_INFO @ 64075580 ps: (uvm_report_catcher.svh:705) [UVM/REPORT/CATCHER]
and the waves support the analysis I wrote above.
@GregAC: Would you mind taking a look at this? I'm rather hoping that I've misunderstood a bit of the OTBN security model.
Thanks for raising this and your effort debugging this!
I've only added the .Secure(1)
option to the u_reqfifo
and u_sramreqfifo
:
https://github.com/lowRISC/opentitan/blob/574ece386f821fa4bba1b390b859c2c32043845b/hw/ip/tlul/rtl/tlul_adapter_sram.sv#L526
https://github.com/lowRISC/opentitan/blob/574ece386f821fa4bba1b390b859c2c32043845b/hw/ip/tlul/rtl/tlul_adapter_sram.sv#L550
Before my change, this option only was set for the u_rspfifo
:
https://github.com/lowRISC/opentitan/blob/574ece386f821fa4bba1b390b859c2c32043845b/hw/ip/tlul/rtl/tlul_adapter_sram.sv#L576
As the issue you are seeing also affects other modules (c.f. #23567), I also tried debugging this but I could not find a solution. I wonder why it seems to work for the u_rspfifo
FIFO but not for the other FIFOs.
In the worst case, we need to revert #23515 but from a sec. point of view it would be better having the Secure option also enabled for these FIFOs. WDYT @vogelpi?
I am investigating this.
@nasahlpa will now attempt to fix this. It would be better to keep #23515 .
I've now discussed this with Pascal and I believe there is a very easy way to fix this: inside tlul_sram_adapter.sv
where the issue originates we use multiple prim_sync_fifo
primitives with the OutputZeroIfEmpty
parameter set to enabled (default). What now happens is that during these error tests, a fault is injected which makes the FIFO think it's non-empty and as a result, it will no longer output 0 (as per the parameter) but X.
My suggestion is to change the RTL of the primitive such as follows:
if (OutputZeroIfEmpty == 1'b1) begin : gen_output_zero
assign rdata_o = empty ? Width'(0) : rdata_int;
end else begin : gen_no_output_zero
assign rdata_o = rdata_int;
end
to
if (OutputZeroIfEmpty == 1'b1) begin : gen_output_zero
assign rdata_o = empty || err_o ? Width'(0) : rdata_int;
end else begin : gen_no_output_zero
assign rdata_o = rdata_int;
end
meaning in the error case, we output 0 instead of the X. This affects multiple primitives in the design but since the error signal also factors into a fatal alert, we don't add a security risk of outputting a known deterministic value. This will solve DV issues in multiple blocks that were introduced by turning the hardening of these FIFOs on.
I've now discussed this with Pascal and I believe there is a very easy way to fix this: inside
tlul_sram_adapter.sv
where the issue originates we use multipleprim_sync_fifo
primitives with theOutputZeroIfEmpty
parameter set to enabled (default). What now happens is that during these error tests, a fault is injected which makes the FIFO think it's non-empty and as a result, it will no longer output 0 (as per the parameter) but X.My suggestion is to change the RTL of the primitive such as follows:
if (OutputZeroIfEmpty == 1'b1) begin : gen_output_zero assign rdata_o = empty ? Width'(0) : rdata_int; end else begin : gen_no_output_zero assign rdata_o = rdata_int; end
to
if (OutputZeroIfEmpty == 1'b1) begin : gen_output_zero assign rdata_o = empty || err_o ? Width'(0) : rdata_int; end else begin : gen_no_output_zero assign rdata_o = rdata_int; end
meaning in the error case, we output 0 instead of the X. This affects multiple primitives in the design but since the error signal also factors into a fatal alert, we don't add a security risk of outputting a known deterministic value. This will solve DV issues in multiple blocks that were introduced by turning the hardening of these FIFOs on.
Thanks for this suggestion. However, this does not quite work:
With this RTL change, we would output a 0 if the err_o
is raised by the u_fifo_cnt
and set the rvalid_o
of the FIFO. This means that the TL-UL adapter assumes that this data is valid and sends the 0 back over the bus, causing several issues (TL-UL exception as e.g. size/user fields do not match the expected value).
Instead, we should change:
assign rvalid_o = ~empty & ~under_rst;
to
assign rvalid_o = ~empty & ~under_rst & ~err_o;
In addition to the 0 output on err_o
as suggested.
@rswarbrick @nasahlpa @vogelpi spent some time digging into this, here's what's happening:
rvalid_o
to be raised but crucially the data it outputs is X (as the read pointer counter points at a FIFO entry that hasn't had valid data written into it)rvalid_o
result in Xs going back to the request data's FIFO rready_i
input
tlul_adapter_sram
produces its dvalid
for the tilelink response based upon the output of the request FIFO. Specifically if the request FIFO output data has its error field set we want an immediate response (here the error means some issue with the incoming tl request, e.g. it fails a tlul protocol check or it's trying to do a byte write and byte writes aren't allowed)dvalid
factors into the request FIFO rready_i
rready_i
means the internal counters go to X (they don't know if they're changing or not)escalate_en_i
in OTBN core, which goes to Xescalate_en_i
being X means the state of otbn_start_stop_controller
goes to X (The state machine in otbn_controller
has already gone to locked due to the initial error)So we see an initial escalate which causes the OTBN controller to moved to a locked state but otbn_start_stop_controller
is still doing it's initial secure wipe as it's just come out of reset (the test injects a fault then resets DUT then repeats), so the escalate doesn't effect anything immediately (other than setting should_lock_q
). escalate_en_i
going to X messes things up because of these lines:
We have a unconditional move to the locked state where the escalate_en_i
MUBI value is invalid. This is reasonable but the problem is where escalate_en_i
is X we end up with an uncertain state!
So we get an initial alert from the fault, this escalates to OTBN which should lock, but never actually signals lock because of the above which means the otbn_start_stop_control
state goes to X, which in particular means we never get the locking_o
signal output.
The question now is what, if anything, should we fix in RTL (especially as we're into RTL freeze but could maybe slip in a small change)? There's a few things we could do:
rvalid
behaviour will not change). Would also be easy to remove via an ECO if we decided it has other negative impacts (possibly easy to add as well, given we've already got logic in there that does the zeroing we just need to add a new way to trigger it).tlul_adapter_sram
, e.g. ignoring data from the request FIFO if the request FIFO has an error, so we'd get 0 for dvalid
in this case and no Xs going back into the request FIFO controlotbn_start_stop_control
state machine that stops testing escalate_en_i
for invalid values once we know we're locking anyway (e.g. because we've already seen a valid escalation request)escalate_en_i
in this case cause our otbn_start_stop_control
state to go to X. We know here the escalation has been triggered and we will lock in any scenario with the real hardware (e.g. consider possibilities if escalate_en_i
was a different random, non X value every cycle following the initial escalation where it's MUBI4True
for a cycle)I think 2. is likely similar to @nasahlpa's initial suggested fix (https://github.com/lowRISC/opentitan/pull/23595) which was abandoned due to it causing TLUL timeout issues under other scenarios (where a response that should have gone out gets killed), though perhaps by doing it in tlul_adapter_sram
instead we could avoid those problems
MUBI4True
we do eventually reach a fully locked state regardless of what escalate_en_i
values we see.Ultimately I think it does come down to gaining confidence in the escalate_en_i
behaviour. The problem is if we get an actual invalid MUBI value we'll go to the locked state sooner, but we would have gone there anyway eventually. Due to the way the simulator handles Xs though we cannot confirm this (in effect we'd need the sim to fork and run one thread going down the state machine assuming we continue to have a valid escalate_en_i
value and the other assuming we continue to have an invalid escalate_en_i
value, finding they both eventually end up with the state going to the same value OtbnStartStopStateLocked
and thus eventually resolving the state to that value).
My recommendation would be to consider @vogelpi's suggested fix for late inclusion, and to consider ways to test this escalate_en_i
behaviour (perhaps we can do something hacky that allows us to force it to random values every cycle once we've seen it go X or similar?).
Description
A bit of bisection shows that this error was caused by:
574ece386f * [flash_ctrl/rtl] Enable SecFifoPtr for TL-UL FIFOs
To run a test:
util/dvsim/dvsim.py hw/ip/otbn/dv/uvm/otbn_sim_cfg.hjson -i otbn_sec_cm --fixed-seed=123
, which will fail with a message like:Taking a quick look at a log, this error appears just after:
so presumably we're telling a FIFO that it has entries, but they actually turn out to be 'X.
My guess is that the assertion needs turning off in this situation with an
$assertoff
call.@nasahlpa: I think this change is yours, and I notice that you've tweaked the TLUL adapter. Does my analysis above sound plausible to you?