[usbdev] Increase robustness of packet reception when SYNC signal is corrupted.

moidx commented 3 months ago

Description (copied over from issue #23719)

The packet reception within USBDEV does not cope well with a corrupted/invalid SYNC signal at the start of the packet, leading to a failure of the OUT side state machine. Invalid/corrupted SYNC signals do not occur in normal operation but there is a risk of them being received through interference, signal integrity issues or substantial host-device frequency mismatch leading to sampling issues.

The observed behavior in block level DV is that the 'pkt_start_o' signal is asserted by 'usb_fs_rx' leading to the OUT side state machine (usb_fs_nb_out_pe) advancing into StRecvdDataStart in anticipation of receiving a complete, valid data packet. The signal 'valid_packet_o' from the _rx module is not asserted for that packet and with the present logic this means that 'pkt_end_o' is also not asserted. The state machine relies critically upon 'pkt_end_o' in order to leave the StRecvdDataStart at the end of the packet. It thus wait until the end of a subsequent packet, if any, at which point communications failure is likely to occur through - for example - a Control Transfer having been retired unsuccessfully.

An additional problem is that the packet reception logic will start up again if the KJKJKK sequence is spotted within the packet - where it may quite legitimately occur - following detection of a bit stuffing violation within a packet.

PR #23717 proposes a change to the RTL to improve resilience/recovery and permits the block level DV tests (usbdev_invalid_sync and usbdev_stress_usb_traffic) to run to completion.

Description (copied over from PR #23717)

Commit 1:

Modify end of packet detection so that it does not depend upon the validity of the SYNC signal. End of packet must always be signaled to the packet state machines even if the packet has been declared invalid, because otherwise the OUT side state machine will wait indefinitely for the end of a DATA packet.

This does not occur in normal use, but may occur in the presence of interference, signal integrity issues and/or improper sampling.

Background: fault injection testing in block level DV transmitted a valid OUT packet which was then followed by a DATA packet with a corrupted SYNC signal. The result is that 'packet_valid_q' does not become set for the DATA packet, although 'pkt_start_o' is asserted, and there is thus no 'pkt_end_o' assertion for the DATA packet. The OUT side state machine remains in the StRcvdDataStart state, from which there is no exit until the end of a subsequent packet.

The RTL change proposed in this draft PR permits the block level test to run to completion. (The test in question is generating invalid traffic whilst a streaming + checking process is running and keeping the DUT busy; the invalid traffic is expected to produce no response from the DUT but because of the above issue caused interference with the streaming operation.)

Update: There is a further deficiency in the handling of 'packet_valid_q' because of the way that bit stuffing errors and preamble cause the signal to be deasserted. After a bit stuffing error the logic can wake up again mid-packet in response to a bit sequence that looks like a SYNC signal. There is nothing to stop a KJKJKK sequence occurring within the ensuing bits. This PR now remembers the fact that the packet is being ignored and prevents 'packet_valid_d|q' becoming reasserted prematurely.

DV sequence usbdev_bitstuff_err has been modified in PR #24234 to sidestep the problems with detection of bit stuffing violations, pending inclusion of RTL refinements in a later revision.

moidx commented 3 months ago

Hi @a-will can we get your assessment on this issue?

a-will commented 3 months ago

@alees24 found a robustness / fault tolerance issue. The OUT packet engine can freeze if there are errors in the SYNC signal during the data phase of the transaction. At first glance, the severity doesn't seem particularly high, though, as the likelihood of unrecoverable failures seems to be very low.

The FSM should unwedge on the next token packet (heading back to Idle and transmitting no handshake packet for potentially both transactions) or on the reception of a SYNC-like data pattern, where a highly unlikely sequence could lead to bad data being accepted. For the latter case, the data stream would need to have the equivalent of a SYNC + correct PID and still pass the CRC-16 for bad data to be accepted.

There's some badness in the interface between usb_fs_rx and other modules that could be cleared away by not exposing them to the SYNC at all. If response timeouts were increased to accommodate time for a SYNC, then usb_fs_rx.pkt_start_o could be properly tied to usb_fs_rx.packet_start, and the interface would become the boundaries that mark the regions where we receive actual bits. In that case, the PID check would be guaranteed to occur.

After a failed SYNC, usb_fs_rx probably should also be rejecting inputs until it sees at least the [SE0, J] sequence, but the packet engines shouldn't need it.

Overall, if the RTL fix is small enough and can be taken easily, I think it'd be good to make the ECO. However, the bug doesn't seem critical enough to warrant any more major efforts (e.g. resynthesis). The biggest consequence of not having the fix is probably the nuisance the bug creates for sign-off and DV.

Please correct me, @alees24, if I have made an error in the reasoning.

vogelpi commented 3 months ago

We've discussed this issue in the tape-out sync yesterday (2024-06-24) and reached consensus to:

Not merge this / integrate this as ECO now. The reasons are mostly that i) the issue isn't super critical, ii) normal operation shouldn't be affected (the fix improves things in the presence of interference and signal integrity issues), and iii) to increase confidence proper FPGA soak time would be needed which we don't have.
Integrate the change post M5 after the TO branch has been taken to allow this USBDEV improvement for future releases / revisions of the design.

Integrating this improvement after taking the TO branch should be easy as it doesn't involve any software changes which could break the compatibility with Earlgrey-PROD A1.

andreaskurth commented 3 months ago

We just discussed this in triage and agreed to put it into Backlog, because it won't be implemented with an ECO or metal fix, thus not part of Earlgrey-PROD.

lowRISC / opentitan

[usbdev] Increase robustness of packet reception when SYNC signal is corrupted. #23770

Description (copied over from issue #23719)

Description (copied over from PR #23717)