Fatal Glitching in Internal Asynchronous Reset Generation

meggiman commented 2 years ago

Besides the additional Xilinx BSCANE2 related features, commit b5d1610ec869f17 introduced some new logic to allow the dmi_jtag module to be reset with TMS cycling by entering the "TestLogicReset" state. However, the way this is implemented in dmi_jtag_tap.sv line 205 is extremely dangerous; the generated trst_no signal is prone to glitching and actually drives asynchronous resets within dmi_jtag.sv and dmi_cdc.sv

meggiman commented 2 years ago

@msfschaffner

msfschaffner commented 2 years ago

Thanks for reporting this to us, @meggiman

CC @vogelpi @GregAC @rswarbrick @tjaychen

I believe we are not yet vendoring in this bug, since we are using an older version.

But given that several other bugs have been fixes in the mean time, we should re-vendor the module AFTER this glitching issue has been resolved.

bluewww commented 2 years ago

The lastest release I did was v0.4.1 which doesn't have this bug.

tjaychen commented 2 years ago

from the perspective of opentitan, do we care about trst_no being potentially vulnerable to glitches? Assuming we did our hardening right, if we are not in the right life cycle state, the output for these things should be guarded or made ineffective (for example here).

Let me know what you think.

msfschaffner commented 2 years ago

I believe it is less of a security concern, but more a reset signal integrity issue in this case. I.e., the reset signal is not cleanly asserted / deasserted, but may glitch around as it is driven by combo FSM logic.

I agree that security-wise, this should not be a concern in PROD states in OpenTitan since the debug module reset request is gated in those states: https://github.com/lowRISC/opentitan/blob/b82502587e02790dec9f9301180f9d50e489bd5e/hw/ip/rv_dm/rtl/rv_dm.sv#L196-L197

vogelpi commented 2 years ago

We currently vendor in 4befe83 which is the commit directly before v.0.4.1. The only difference between the two is the version tag. The critical commit is just the one after v.0.4.1.

I agree with @tjaychen that from a hardening perspective we should be able to guard against this. So at least for OT it's more a question of good design practice and avoiding tooling/backend issues.

Looking at the latest version of dmi_jtag_tap.sv: https://github.com/pulp-platform/riscv-dbg/blob/b5d1610ec869f174755688e9d21f5af9b3357468/src/dmi_jtag_tap.sv#L199-L220

I am not sure if its really a big issue as trst_no involves only little combinational logic. The logic is controlled directly by a set of flip flops (tap_state_q). But one still has to be aware of this from a tooling point of view. Also, the it looks like trst_no is only used locally inside dmi_cdc. Wouldn't it be possible to put a proper reset synchronizer on trst_no? Anyway, thanks for flagging this @meggiman

msfschaffner commented 2 years ago

I believe this actually caused a real issue on one of the PULP FPGA prototypes (@meggiman can probably say more about this, since he found the bug).

Agreed, I would also just use a standard reset synchronizer for this signal.

meggiman commented 2 years ago

Indeed @msfschaffner! The problem is, that it is not a single flip-flop that controls the reset but several of them (tap_state_q). The signal is thus driven by combinational logic using those FFs as inputs. The problem is not just theoretical or tooling related. I only discovered this because OpenOCD failed to communicate with my design post implementation (in an FPGA) for exactly that reason. Transitions between JTAG states caused spurious resets even though we did not enter the TestLogicReset state because the state transitions caused glitching on the reset line.

Furthermore, there is a fundamental problem in resetting the CDC (within dmi_cdc.sv) one-sidedly; depending on the current state of the CDC, resetting only the sending or the receiving side of it will most likely cause spurious transaction of invalid data to be issued (e.g. if you reset the read pointer of a gray-counting CDC FIFO without also simultaneously resetting the write pointer). However, requiring support for both asynchronous reset (jtag_trst_ni) and functional reset (TestLogicReset and dmactive bit) that reliably resets also the CDC boundary to flush outstanding DMI requests does not seem like a trivial problem to me. Just driving both async resets from the JTAG side is for sure gonna introduce reset-domain-crossing issues in the SoC domain (after all the JTAG resets are not only expected to happen during power-on). Just synchronizing the reset (deassertion and assertion due to RDC) into the other domain is not trivial either: After all, we are breaking CDC protocol assumptions (only one bit changes per cycle) when we simultaneously reset all FFs. So if we synchronize the reset with the same latency into the destination clock domain as the data signals, we cannot be sure that the reset signal is sampled in the same cycle that all the other domain crossing signals change value to the reset state. Also, are we still sure the CDC is still working correctly when one side deasserts the reset before the other?

I'm currently working on an enhanced version of the CDC used in the dmi_cdc.sv that supports clearing from either side to use it for fixing this issue. It uses a closed-loop synchronization of the clear signal into the other domain and ensures that the clear signal arrives in the other domain before the cleared domain crossing signals are sampled in the destination domain. If you know any alternative or proven to work solutions to flush a CDC without clock gating one of the domains I would be very open to suggestions.

vogelpi commented 2 years ago

Hi @meggiman , thanks for giving more details on why this isn't a good idea. This sounds really bad indeed!

About the CDC: I agree that getting this right is not trivial. I would recommend Hubert Kaeslin's book on this, see Chapter 7 in the first edition. Maybe using a 4-phase handshake synchronizer instead of gray-counting CDC FIFO can simplify things a bit. If I remember correctly, 4-phase version always knows the state of the other side (e.g. if dst has accepted/consumed the data already). It has higher latency and lower throughput though. For details, see Section 7.2.4 in the book.

tjaychen commented 2 years ago

hey all, sorry i completely read the initial bug wrong. I thought the issue was about a flop being vulnerable to attack glitching, i didnt realize the question was about combo state glitching for an async reset.

Yes then i definitely agree this needs to be fixed in some way.

msfschaffner commented 2 years ago

CC @rswarbrick @gregAC

pulp-platform / riscv-dbg

Fatal Glitching in Internal Asynchronous Reset Generation #120