lowRISC / opentitan

OpenTitan: Open source silicon root of trust
https://www.opentitan.org
Apache License 2.0
2.57k stars 764 forks source link

[alert_handler_escalation] Avoid missing escalation requests #24100

Open matutem opened 3 months ago

matutem commented 3 months ago

Description

There is a similar issue in lc_ctrl, and perhaps in rv_cor_ibex (for NMI requests): escalation requests detected with clk_esc_i and processed with clk_i. However, at least for lc_ctrl we most likely program the threshold to a value enough for any side-effects to complete before the next phase.

The clk_esc_i and clk_i run at the same frequency but clk_esc_i can be gated while clk_i is running. I think this is not a CDC problem for the chip if the alert handler is programmed correctly as explained below.

Once an escalation request is received by pwrmgr it is flopped and will stay active until reset. So the key is whether we can miss a pulse.

The logic driving the escalation requests in alert_handler will move to progressively higher phases whenever a counter reaches a per phase threshold, and will stay in any phase threshold + 1 cycles. When it moves through all states it reaches a terminal state.

If we want to avoid skipping escalation phases it is safer to program the phase cycles to a positive value for each enabled phase so the escalation request exceeds 2 cycles. For the phase that triggers the escalation hooked up to pwrmgr it could safely be programmed with a really large number, since the resulting reset will erase everything except the crashdump.

Notice even if we add synchronizers we could miss escalation requests if the threshold is zero since a phase could last a single cycle.

This means this is probably a documentation note.

vogelpi commented 3 months ago

Thanks @matutem for creating the issue. There are a couple of different points here that we should address differently.

In pwrmgr the escalation receiver is clocked by clk_esc_i and outputs esc_rst_req_d, https://github.com/lowRISC/opentitan/blob/master/hw/ip_templates/pwrmgr/rtl/pwrmgr.sv#L139. Its output is clocked by clk_lc_i and is used to trigger a reset

We've also come across this yesterday:

What we need to ensure here is that the tools correctly time this path

top_earlgrey/u_pwrmgr/u_esc_rx/u_prim_buf_esc_req/*
->
top_earlgrey/u_pwrmgr/esc_rst_req_q

And that it's not considered false. @meisnere from the PD team has just confirmed it's not considered false. So we are good here.

There is a similar issue in lc_ctrl, and perhaps in rv_core_ibex (for NMI requests): escalation requests detected with clk_esc_i and processed with clk_i.

Yes, the ones in lc_ctrl are fine. The one in rv_core_ibex (NMI) had a serious problem: There was a 2-flop synchronizer but the signal going into the synchronizer wasn't glitch free. This triggered a series of ECOs yesterday. It's fixed now.

However, at least for lc_ctrl we most likely program the threshold to a value enough for any side-effects to complete before the next phase.

The clk_esc_i and clk_i run at the same frequency but clk_esc_i can be gated while clk_i is running. I think this is not a CDC problem for the chip if the alert handler is programmed correctly as explained below.

Once an escalation request is received by pwrmgr it is flopped and will stay active until reset. So the key is whether we can miss a pulse.

The logic driving the escalation requests in alert_handler will move to progressively higher phases whenever a counter reaches a per phase threshold, and will stay in any phase threshold + 1 cycles. When it moves through all states it reaches a terminal state.

If we want to avoid skipping escalation phases it is safer to program the phase cycles to a positive value for each enabled phase so the escalation request exceeds 2 cycles. For the phase that triggers the escalation hooked up to pwrmgr it could safely be programmed with a really large number, since the resulting reset will erase everything except the crashdump.

Notice even if we add synchronizers we could miss escalation requests if the threshold is zero since a phase could last a single cycle.

This means this is probably a documentation note.

Yes I agree that this point can be covered by a documentation note. I'll tag the issue accordingly and move it to M7 for now.