lowRISC / opentitan

OpenTitan: Open source silicon root of trust
https://www.opentitan.org
Apache License 2.0
2.51k stars 746 forks source link

[rtl] Are design guidlines for use of `prim_lc_sync` correct? #19044

Closed GregAC closed 1 year ago

GregAC commented 1 year ago

The life cycle controller documentation states that for the multibit lifecycle signals only a simple 2 stage synchronizer is required and that prim_lc_sync (https://opentitan.org/book/hw/ip/lc_ctrl/doc/theory_of_operation.html#control-signal-propagation) must be used.

With CDC instrumentation enabled a failure is seen in the OTBN regression due to the use of prim_lc_sync. The lc_rma_req signal switches from off to on. The randomized delays in prim_lc_sync (due to the CDC instrumentation) result in OTBN observing an invalid MUBI value between the off and on transition (the bits flips don't all come through together). This results in OTBN raising a fatal alert and going into a locked state.

Is this assertion in the life cycle controller documentation sufficient here:

However, since the ON / OFF encoding above has been chosen such that all bits toggle exactly once for a transition from OFF to ON (and vice-versa), all that needs to be done is guard against metastability using a two-stage synchronizer, as illustrated below

Whilst a synchronizer will prevent metastability it doesn't guarantee all bit transitions will make it through at once. Perhaps we can guarantee this from the life cycle controller design itself (e.g. these signals come direct from flops which are always physically grouped together).

Alternatively OTBN has used these incorrectly and there's no guarantee invalid values won't be seen and receiving blocks should just compare to on and off and ignore invalid values (rather than triggering an alert as OTBN does).

P0 as this could be a major issue with OTBN (and maybe other blocks that do similar things), but it may be we're happy to guarantee we won't see invalid values out of a prim_lc_sync under these circumstances and there's no issue here.

@msfschaffner @matutem @moidx any thoughts?

a-will commented 1 year ago

The life cycle controller documentation states that for the multibit lifecycle signals only a simple 2 stage synchronizer is required and that prim_lc_sync (https://opentitan.org/book/hw/ip/lc_ctrl/doc/theory_of_operation.html#control-signal-propagation) must be used.

With CDC instrumentation enabled a failure is seen in the OTBN regression due to the use of prim_lc_sync. The lc_rma_req signal switches from off to on. The randomized delays in prim_lc_sync (due to the CDC instrumentation) result in OTBN observing an invalid MUBI value between the off and on transition (the bits flips don't all come through together). This results in OTBN raising a fatal alert and going into a locked state.

Is this assertion in the life cycle controller documentation sufficient here:

However, since the ON / OFF encoding above has been chosen such that all bits toggle exactly once for a transition from OFF to ON (and vice-versa), all that needs to be done is guard against metastability using a two-stage synchronizer, as illustrated below

Whilst a synchronizer will prevent metastability it doesn't guarantee all bit transitions will make it through at once. Perhaps we can guarantee this from the life cycle controller design itself (e.g. these signals come direct from flops which are always physically grouped together).

Alternatively OTBN has used these incorrectly and there's no guarantee invalid values won't be seen and receiving blocks should just compare to on and off and ignore invalid values (rather than triggering an alert as OTBN does).

Sounds like we've got some conflicting designs here. The synchronization scheme only works if the signaling portion of the doc is followed:

For all life cycle signals except the escalation signal ESCALATE_EN, all values different from ON must be interpreted as OFF in RTL. In case of ESCALATE_EN, all values different from OFF must be interpreted as ON in RTL.

It sounds like we decided that there is a third space that is now marked as "invalid", and that invalidates the synchronization methodology, hehe.

Edit: Actually, if ESCALATE_EN triggers something when it goes to ON, that isn't going to be safe with just a prim_flop_2sync. Intermediate values will be bad.

GregAC commented 1 year ago

It sounds like we decided that there is a third space that is now marked as "invalid", and that invalidates the synchronization methodology, hehe.

Yes it does seem this way. So this is an OTBN RTL bug, potentially a critical one if we end up hitting the invalid signal case in normal power on.

We should review all uses of prim_lc_sync to see if there are other potential issues (especially in light of @a-will's edit)

GregAC commented 1 year ago

The saving grace here may be how lifecycle controller operates (from https://opentitan.org/book/hw/ip/lc_ctrl/doc/theory_of_operation.html#power-up-sequence)

Once the broadcast is complete and signals stable, modules held under reset by “sys_rst_n” are then released from reset to begin mission mode operations (this includes the processor). Note this point is also when it is safe for DFT to commence operations, as DFT functions may be blocked until life cycle completes its broadcast.

The signals from lifecycle controller are effectively static as they're set at power on and nothing else will come out of reset til they're hopefully stable.

Also note (from https://opentitan.org/book/hw/ip/lc_ctrl/doc/theory_of_operation.html#normal-operation)

Once the life cycle system is powered up and stable, its outputs remain static unless specifically requested to change or affected by security escalation.

So it may be we only get issues around life cycle transitions and we can reset to bring the system back to a usable state.

Do we have any scenarios where we change life cycle and we explicitly cannot reset?

GregAC commented 1 year ago

Ah it seems on a transition a reboot is explicitly required (from https://opentitan.org/book/hw/ip/lc_ctrl/doc/theory_of_operation.html#post-transition-handling)

After a transition request, whether it was unconditional or conditional, the life cycle controller always disables all of its decoded outputs and puts the system in an inert state. The device is then expected to reboot before returning to a functional state.

This may all be fine then, indeed you could even argue the synchronisers aren't needed at all!

msfschaffner commented 1 year ago

Thanks @GregAC and @a-will for digging into this.

Let me add a bit of context: The simpler synchronization approach has been chosen since it basically avoids the addition of a multiplexer and additional flop stage to perform a consistency check for outputting a "rectified" value - which could introduce additional vulnerabilities, depending on how this is implemented. This scheme only works since the lc signals are quasi-static and only encode two states (ON, OFF). However, we need to follow the design guidance in the consuming blocks, since stagger is expected after CDCs.

I.e., we should only ever test for two sets of values, depending on the intent of the signal, as @a-will has pointed out. E.g. for enabling critical functionality, we would test for == On and != On, whereas for lc_escalate we would test for == Off and != Off.

Now, we do have applications where you want to test for invalid multibit encodings between On and Off - but this should only be done in designs where all sources of the signals are completely synchronous and in the same clock domain. Because of that, it is typically something we should only do with MuBi* signals (if at all), but not with lifecycle signals.

I realize that OTBN converts several life cycle signals to MuBi* signals, which may be a source of confusion.


@GregAC is correct that this should not be an issue for most cases due to the reset sequencing of the system. Also, the behavior after performing a life cycle transition is largely irrelevant since we are going to reset the system anyways.

However, it looks like we still need to devise a fix for this due to the following reasons:

I think we should try evaluating fixes on multiple fronts:

msfschaffner commented 1 year ago

Update after further investigations:

https://github.com/lowRISC/opentitan/blob/081f7a2b0b1304d5b00e72a0f8406b8094819a8a/hw/top_earlgrey/data/top_earlgrey.hjson#L669-L675

https://github.com/lowRISC/opentitan/blob/081f7a2b0b1304d5b00e72a0f8406b8094819a8a/hw/top_earlgrey/data/top_earlgrey.hjson#L596-L598

The action items are hence as follows:

msfschaffner commented 1 year ago

The CDC path between flash_ctrl and otbn has been checked in prime time and is confirmed to be a synchronous path. The path is therefore not of concern for the ES tapeout.

Since the remaining action items have been spun out into seprate issues, this issue can be closed.