Closed GregAC closed 1 year ago
The life cycle controller documentation states that for the multibit lifecycle signals only a simple 2 stage synchronizer is required and that
prim_lc_sync
(https://opentitan.org/book/hw/ip/lc_ctrl/doc/theory_of_operation.html#control-signal-propagation) must be used.With CDC instrumentation enabled a failure is seen in the OTBN regression due to the use of
prim_lc_sync
. The lc_rma_req signal switches from off to on. The randomized delays inprim_lc_sync
(due to the CDC instrumentation) result in OTBN observing an invalid MUBI value between the off and on transition (the bits flips don't all come through together). This results in OTBN raising a fatal alert and going into a locked state.Is this assertion in the life cycle controller documentation sufficient here:
However, since the ON / OFF encoding above has been chosen such that all bits toggle exactly once for a transition from OFF to ON (and vice-versa), all that needs to be done is guard against metastability using a two-stage synchronizer, as illustrated below
Whilst a synchronizer will prevent metastability it doesn't guarantee all bit transitions will make it through at once. Perhaps we can guarantee this from the life cycle controller design itself (e.g. these signals come direct from flops which are always physically grouped together).
Alternatively OTBN has used these incorrectly and there's no guarantee invalid values won't be seen and receiving blocks should just compare to on and off and ignore invalid values (rather than triggering an alert as OTBN does).
Sounds like we've got some conflicting designs here. The synchronization scheme only works if the signaling portion of the doc is followed:
For all life cycle signals except the escalation signal ESCALATE_EN, all values different from ON must be interpreted as OFF in RTL. In case of ESCALATE_EN, all values different from OFF must be interpreted as ON in RTL.
It sounds like we decided that there is a third space that is now marked as "invalid", and that invalidates the synchronization methodology, hehe.
Edit: Actually, if ESCALATE_EN triggers something when it goes to ON, that isn't going to be safe with just a prim_flop_2sync. Intermediate values will be bad.
It sounds like we decided that there is a third space that is now marked as "invalid", and that invalidates the synchronization methodology, hehe.
Yes it does seem this way. So this is an OTBN RTL bug, potentially a critical one if we end up hitting the invalid signal case in normal power on.
We should review all uses of prim_lc_sync
to see if there are other potential issues (especially in light of @a-will's edit)
The saving grace here may be how lifecycle controller operates (from https://opentitan.org/book/hw/ip/lc_ctrl/doc/theory_of_operation.html#power-up-sequence)
Once the broadcast is complete and signals stable, modules held under reset by “sys_rst_n” are then released from reset to begin mission mode operations (this includes the processor). Note this point is also when it is safe for DFT to commence operations, as DFT functions may be blocked until life cycle completes its broadcast.
The signals from lifecycle controller are effectively static as they're set at power on and nothing else will come out of reset til they're hopefully stable.
Also note (from https://opentitan.org/book/hw/ip/lc_ctrl/doc/theory_of_operation.html#normal-operation)
Once the life cycle system is powered up and stable, its outputs remain static unless specifically requested to change or affected by security escalation.
So it may be we only get issues around life cycle transitions and we can reset to bring the system back to a usable state.
Do we have any scenarios where we change life cycle and we explicitly cannot reset?
Ah it seems on a transition a reboot is explicitly required (from https://opentitan.org/book/hw/ip/lc_ctrl/doc/theory_of_operation.html#post-transition-handling)
After a transition request, whether it was unconditional or conditional, the life cycle controller always disables all of its decoded outputs and puts the system in an inert state. The device is then expected to reboot before returning to a functional state.
This may all be fine then, indeed you could even argue the synchronisers aren't needed at all!
Thanks @GregAC and @a-will for digging into this.
Let me add a bit of context: The simpler synchronization approach has been chosen since it basically avoids the addition of a multiplexer and additional flop stage to perform a consistency check for outputting a "rectified" value - which could introduce additional vulnerabilities, depending on how this is implemented. This scheme only works since the lc signals are quasi-static and only encode two states (ON, OFF). However, we need to follow the design guidance in the consuming blocks, since stagger is expected after CDCs.
I.e., we should only ever test for two sets of values, depending on the intent of the signal, as @a-will has pointed out. E.g. for enabling critical functionality, we would test for == On
and != On
, whereas for lc_escalate
we would test for == Off
and != Off
.
Now, we do have applications where you want to test for invalid multibit encodings between On
and Off
- but this should only be done in designs where all sources of the signals are completely synchronous and in the same clock domain. Because of that, it is typically something we should only do with MuBi*
signals (if at all), but not with lifecycle signals.
I realize that OTBN converts several life cycle signals to MuBi*
signals, which may be a source of confusion.
@GregAC is correct that this should not be an issue for most cases due to the reset sequencing of the system. Also, the behavior after performing a life cycle transition is largely irrelevant since we are going to reset the system anyways.
However, it looks like we still need to devise a fix for this due to the following reasons:
I think we should try evaluating fixes on multiple fronts:
lc_escalate_en
and lc_rma_req
. I think it is fine to not fix lc_escalate_en
, since that signal will only change it's value upon escalation. the reaction to an invalid encoding is similar to escalation itself. hence, can you try to assess how big an RTL fix for lc_rma_req
would be?rstmgr
FSM and assess whether a fix to the reset sequence is necessary.Update after further investigations:
rstmgr
is consistent with what has been proposed by @tjaychen on https://github.com/lowRISC/opentitan/issues/15317 (slides): the lc reset domain now encompasses all system logic, except for the rv_dm
and the pinmux
strap sampling logic, which are driven by the sys reset. With this simplified architecture, it is not possible to sequence the resets as they have been before (lc before sys) since most blocks get reset with the lc reset anyways. This is OK IFF the design guidance for consuming LC signals is followed. Hence, no change needed in the rstmgr
.lc_escalate_en
and lc_rma_req
signals have an invalid encoding check that could lead to spurious local escalation, see https://docs.google.com/spreadsheets/d/1I4HU5YdVuwhpiGj9_RzrORpEJSWkyzHhdRFYLfoeyKk/edit#gid=0lc_escalate_en
we concluded that this does not have to be fixed since that signal gets reset to OFF
, and it stays at OFF
unless an escalation event occurs. In that case, the local state wiping mechanism might not complete correctly in the presence of CDC stagger - but for ES we think this is acceptable.lc_rma_req
, we would like to ensure that RMA entry is still possible in a reliable way. The signal is reset to OFF
and it stays at OFF
unless RMA entry is requested. A potential ECO that avoids CDC stagger from preventing RMA entry would be a wiring change where the request input signal is connected to the acknowledgement output signal https://github.com/lowRISC/opentitan/pull/19047. However, after further investigations we determined that this is not needed, since the RMA request signal is driven synchronously to otbn
inside flash_ctrl
. I.e., the RMA request signal is daisy chained through these two blocks and both have the same clock root). Hence, the potential CDC issue found at the block-level is not problematic for the Earlgrey configuration.The action items are hence as follows:
flash_ctrl
to otbn
is actually synchronous.The CDC path between flash_ctrl
and otbn
has been checked in prime time and is confirmed to be a synchronous path.
The path is therefore not of concern for the ES tapeout.
Since the remaining action items have been spun out into seprate issues, this issue can be closed.
The life cycle controller documentation states that for the multibit lifecycle signals only a simple 2 stage synchronizer is required and that
prim_lc_sync
(https://opentitan.org/book/hw/ip/lc_ctrl/doc/theory_of_operation.html#control-signal-propagation) must be used.With CDC instrumentation enabled a failure is seen in the OTBN regression due to the use of
prim_lc_sync
. The lc_rma_req signal switches from off to on. The randomized delays inprim_lc_sync
(due to the CDC instrumentation) result in OTBN observing an invalid MUBI value between the off and on transition (the bits flips don't all come through together). This results in OTBN raising a fatal alert and going into a locked state.Is this assertion in the life cycle controller documentation sufficient here:
Whilst a synchronizer will prevent metastability it doesn't guarantee all bit transitions will make it through at once. Perhaps we can guarantee this from the life cycle controller design itself (e.g. these signals come direct from flops which are always physically grouped together).
Alternatively OTBN has used these incorrectly and there's no guarantee invalid values won't be seen and receiving blocks should just compare to on and off and ignore invalid values (rather than triggering an alert as OTBN does).
P0 as this could be a major issue with OTBN (and maybe other blocks that do similar things), but it may be we're happy to guarantee we won't see invalid values out of a
prim_lc_sync
under these circumstances and there's no issue here.@msfschaffner @matutem @moidx any thoughts?