[rtl] Are design guidlines for use of `prim_lc_sync` correct?

The life cycle controller documentation states that for the multibit lifecycle signals only a simple 2 stage synchronizer is required and that prim_lc_sync (https://opentitan.org/book/hw/ip/lc_ctrl/doc/theory_of_operation.html#control-signal-propagation) must be used.

With CDC instrumentation enabled a failure is seen in the OTBN regression due to the use of prim_lc_sync. The lc_rma_req signal switches from off to on. The randomized delays in prim_lc_sync (due to the CDC instrumentation) result in OTBN observing an invalid MUBI value between the off and on transition (the bits flips don't all come through together). This results in OTBN raising a fatal alert and going into a locked state.

Is this assertion in the life cycle controller documentation sufficient here:

However, since the ON / OFF encoding above has been chosen such that all bits toggle exactly once for a transition from OFF to ON (and vice-versa), all that needs to be done is guard against metastability using a two-stage synchronizer, as illustrated below

Whilst a synchronizer will prevent metastability it doesn't guarantee all bit transitions will make it through at once. Perhaps we can guarantee this from the life cycle controller design itself (e.g. these signals come direct from flops which are always physically grouped together).

Alternatively OTBN has used these incorrectly and there's no guarantee invalid values won't be seen and receiving blocks should just compare to on and off and ignore invalid values (rather than triggering an alert as OTBN does).

P0 as this could be a major issue with OTBN (and maybe other blocks that do similar things), but it may be we're happy to guarantee we won't see invalid values out of a prim_lc_sync under these circumstances and there's no issue here.

@msfschaffner @matutem @moidx any thoughts?

The life cycle controller documentation states that for the multibit lifecycle signals only a simple 2 stage synchronizer is required and that prim_lc_sync (https://opentitan.org/book/hw/ip/lc_ctrl/doc/theory_of_operation.html#control-signal-propagation) must be used.

With CDC instrumentation enabled a failure is seen in the OTBN regression due to the use of prim_lc_sync. The lc_rma_req signal switches from off to on. The randomized delays in prim_lc_sync (due to the CDC instrumentation) result in OTBN observing an invalid MUBI value between the off and on transition (the bits flips don't all come through together). This results in OTBN raising a fatal alert and going into a locked state.

Is this assertion in the life cycle controller documentation sufficient here:

However, since the ON / OFF encoding above has been chosen such that all bits toggle exactly once for a transition from OFF to ON (and vice-versa), all that needs to be done is guard against metastability using a two-stage synchronizer, as illustrated below

Whilst a synchronizer will prevent metastability it doesn't guarantee all bit transitions will make it through at once. Perhaps we can guarantee this from the life cycle controller design itself (e.g. these signals come direct from flops which are always physically grouped together).

Alternatively OTBN has used these incorrectly and there's no guarantee invalid values won't be seen and receiving blocks should just compare to on and off and ignore invalid values (rather than triggering an alert as OTBN does).

Sounds like we've got some conflicting designs here. The synchronization scheme only works if the signaling portion of the doc is followed:

For all life cycle signals except the escalation signal ESCALATE_EN, all values different from ON must be interpreted as OFF in RTL. In case of ESCALATE_EN, all values different from OFF must be interpreted as ON in RTL.

It sounds like we decided that there is a third space that is now marked as "invalid", and that invalidates the synchronization methodology, hehe.

Edit: Actually, if ESCALATE_EN triggers something when it goes to ON, that isn't going to be safe with just a prim_flop_2sync. Intermediate values will be bad.

It sounds like we decided that there is a third space that is now marked as "invalid", and that invalidates the synchronization methodology, hehe.

Yes it does seem this way. So this is an OTBN RTL bug, potentially a critical one if we end up hitting the invalid signal case in normal power on.

We should review all uses of prim_lc_sync to see if there are other potential issues (especially in light of @a-will's edit)

The saving grace here may be how lifecycle controller operates (from https://opentitan.org/book/hw/ip/lc_ctrl/doc/theory_of_operation.html#power-up-sequence)

Once the broadcast is complete and signals stable, modules held under reset by “sys_rst_n” are then released from reset to begin mission mode operations (this includes the processor). Note this point is also when it is safe for DFT to commence operations, as DFT functions may be blocked until life cycle completes its broadcast.

The signals from lifecycle controller are effectively static as they're set at power on and nothing else will come out of reset til they're hopefully stable.

Also note (from https://opentitan.org/book/hw/ip/lc_ctrl/doc/theory_of_operation.html#normal-operation)

Once the life cycle system is powered up and stable, its outputs remain static unless specifically requested to change or affected by security escalation.

So it may be we only get issues around life cycle transitions and we can reset to bring the system back to a usable state.

Do we have any scenarios where we change life cycle and we explicitly cannot reset?

Ah it seems on a transition a reboot is explicitly required (from https://opentitan.org/book/hw/ip/lc_ctrl/doc/theory_of_operation.html#post-transition-handling)

After a transition request, whether it was unconditional or conditional, the life cycle controller always disables all of its decoded outputs and puts the system in an inert state. The device is then expected to reboot before returning to a functional state.

This may all be fine then, indeed you could even argue the synchronisers aren't needed at all!

Thanks @GregAC and @a-will for digging into this.

Let me add a bit of context: The simpler synchronization approach has been chosen since it basically avoids the addition of a multiplexer and additional flop stage to perform a consistency check for outputting a "rectified" value - which could introduce additional vulnerabilities, depending on how this is implemented. This scheme only works since the lc signals are quasi-static and only encode two states (ON, OFF). However, we need to follow the design guidance in the consuming blocks, since stagger is expected after CDCs.

I.e., we should only ever test for two sets of values, depending on the intent of the signal, as @a-will has pointed out. E.g. for enabling critical functionality, we would test for == On and != On, whereas for lc_escalate we would test for == Off and != Off.

Now, we do have applications where you want to test for invalid multibit encodings between On and Off - but this should only be done in designs where all sources of the signals are completely synchronous and in the same clock domain. Because of that, it is typically something we should only do with MuBi* signals (if at all), but not with lifecycle signals.

I realize that OTBN converts several life cycle signals to MuBi* signals, which may be a source of confusion.

@GregAC is correct that this should not be an issue for most cases due to the reset sequencing of the system. Also, the behavior after performing a life cycle transition is largely irrelevant since we are going to reset the system anyways.

However, it looks like we still need to devise a fix for this due to the following reasons:

the RMA req signal is used to wipe the system before entering RMA. if OTBN gets stuck and never ack's back to the LC_CTRL, RMA entry may not be possible, because the life cycle controller does not perform the transition until the ack has been received. at the same time every transition attempt increments the OTP transition counter, so the amount of RMA entry attempts is limited.
volatile RAW unlock (for debug purposes) may cause issues, since that one does not follow the reset sequencing that @GregAC has pointed out above. this may not be a big issue, though, since volatile unlock is a debug feature for ES only, and it is mainly used to gain access to RV_DM to do some further system inspection if the real RAW unlock transition does not work due to an OTP defect (or similar).
the reset sequencing has unfortunately been changed as part of the reset domain reorganization before M2 last year in #15602. this is likely an oversight in the review. the system still works, but it violates the assumption RE life cycle signals that @GregAC cited above.

I think we should try evaluating fixes on multiple fronts:

@GregAC I see that OTBN only has two life cycle signal inputs: lc_escalate_en and lc_rma_req. I think it is fine to not fix lc_escalate_en, since that signal will only change it's value upon escalation. the reaction to an invalid encoding is similar to escalation itself. hence, can you try to assess how big an RTL fix for lc_rma_req would be?
in the meantime, I will look at the rstmgr FSM and assess whether a fix to the reset sequence is necessary.
I will also take a look at whether other blocks exhibit the same behavior. I don't think so - but better to check.

Update after further investigations:

The reset sequencing change in rstmgr is consistent with what has been proposed by @tjaychen on https://github.com/lowRISC/opentitan/issues/15317 (slides): the lc reset domain now encompasses all system logic, except for the rv_dm and the pinmux strap sampling logic, which are driven by the sys reset. With this simplified architecture, it is not possible to sequence the resets as they have been before (lc before sys) since most blocks get reset with the lc reset anyways. This is OK IFF the design guidance for consuming LC signals is followed. Hence, no change needed in the rstmgr.
@moidx and I audited the use of LC signals in the system and determined that only the lc_escalate_en and lc_rma_req signals have an invalid encoding check that could lead to spurious local escalation, see https://docs.google.com/spreadsheets/d/1I4HU5YdVuwhpiGj9_RzrORpEJSWkyzHhdRFYLfoeyKk/edit#gid=0
Regarding the potential CDC issues on lc_escalate_en we concluded that this does not have to be fixed since that signal gets reset to OFF, and it stays at OFF unless an escalation event occurs. In that case, the local state wiping mechanism might not complete correctly in the presence of CDC stagger - but for ES we think this is acceptable.
Regarding the potential CDC issues on lc_rma_req, we would like to ensure that RMA entry is still possible in a reliable way. The signal is reset to OFF and it stays at OFF unless RMA entry is requested. A potential ECO that avoids CDC stagger from preventing RMA entry would be a wiring change where the request input signal is connected to the acknowledgement output signal https://github.com/lowRISC/opentitan/pull/19047. However, after further investigations we determined that this is not needed, since the RMA request signal is driven synchronously to otbn inside flash_ctrl. I.e., the RMA request signal is daisy chained through these two blocks and both have the same clock root). Hence, the potential CDC issue found at the block-level is not problematic for the Earlgrey configuration.

https://github.com/lowRISC/opentitan/blob/081f7a2b0b1304d5b00e72a0f8406b8094819a8a/hw/top_earlgrey/data/top_earlgrey.hjson#L669-L675

https://github.com/lowRISC/opentitan/blob/081f7a2b0b1304d5b00e72a0f8406b8094819a8a/hw/top_earlgrey/data/top_earlgrey.hjson#L596-L598

The action items are hence as follows:

[x] File issues to fix the usage of LC signals in OTBN https://github.com/lowRISC/opentitan/issues/19050
[x] Update the LC documentation to account for the reset sequencing change #19052
[x] Double check with @arnonsha that the RMA request signal path from flash_ctrl to otbn is actually synchronous.

The CDC path between flash_ctrl and otbn has been checked in prime time and is confirmed to be a synchronous path. The path is therefore not of concern for the ES tapeout.

Since the remaining action items have been spun out into seprate issues, this issue can be closed.

lowRISC / opentitan

[rtl] Are design guidlines for use of `prim_lc_sync` correct? #19044