lowRISC / opentitan

OpenTitan: Open source silicon root of trust
https://www.opentitan.org
Apache License 2.0
2.6k stars 783 forks source link

[test-triage] flaky UART connection between SAM3X and OT #22152

Closed timothytrippel closed 8 months ago

timothytrippel commented 8 months ago

Hierarchy of regression failure

Chip Level

Failure Description

There appears to be a flakiness issue with some tests that make use of UART communication between the host and device (e.g., ones that use ujson). Specifically, in some cases, the host seems to inject 0xff bytes in the UART stream. @pamaury and @jwnrt have more details on the exact issue at hand here.

The issue goes away on the hyper310 platform, suggesting it may be an issue with the SAM3X firmware. Additionally, it seems with older versions of the SAM3X firmware (I.e., version 0.40.1) does not cause the issue, while version 1.2.0 does.

The current OpenTitan code base uses a pinned version of the chipwhisperer-minimal python package that contains version 1.2.0 of the SAM3X firmware, however, the SAM3X firmware on the CI FPGAs is not updated between CI runs, so simply updating the the pinned chipwhisperer-minimal python package will not have an effect. Moreover, it does not seem the tip of tree of the chipwhisperer-minimal python package repo has any newer SAM3X firmware anyway.

There are a few current suggestions for mitigating this issue:

  1. move all SAM3X tests (i.e. FPGA tests that use the cw310 interface to use the hyper310 interface instead); @a-will I thought you tried this once but ran into issues?
  2. downgrade the SAM3X firmware to version 0.40.1; chipwhisperer-minimal python package repo commit with this version is here.
  3. debug the SAM3X current firmware; maybe @colinoflynn can point us in the right direction

In the interim, I filed #22153 to convert a test that was exhibiting this issue to solely use the hyper310 interface.

Steps to Reproduce

@pamaury and @jwnrt can explain how they were able to pinout signals on the PMOD connector to tap, and the challenges they are currently encountering with doing so.

Tests with similar or related failures

Maybe others (@pamaury @jwnrt)?

pamaury commented 8 months ago

Just some background on how we narrowed down the issue and a small caveat. Since the path connecting the FPGA UART pin to the SAM3X is on the PCB with no header to connect to, we were no able to directly analyze the signal with a logic analyzer. However, we first confirmed using wireshark and tracing the USB packets that the extra 0xff was never sent by the host.

After that, Adrian from our team had the following clever suggestion: setup the UART1 is line loopback mode and using the pinmux, connect the UART1 RX to the same pin as UART0 RX (ie the SAM3x TXD). Then connect the UART1 TX to a pin on the PMOD header. Now we have a "copy" of the SAM3x TXD on the PMOD header and we can attach a logic analyzer. We were able to confirm that the extra 0xff was indeed on the wire.

Now there is a small caveat to that: technically we are not seeing exactly the SAM3x TXD, we are seeing what the UART think it is seeing so any problem in the pinmux or on the sync flops in the UART could cause an issue. In particular, note that 0xff is literally just a high-low-high signal so a glitch would be identified as 0xff. Nevertheless, this seems unlikely because on the analyzer, we saw back-to-back characters with perfect timing including the 0xff so it would be incredibly unlikely a glitch was inserted with exactly the length of character: uart_ff

jwnrt commented 8 months ago

We found this flakiness to be reliably reproducible on version 1.2.0 of the CW310's SAM3X firmware, but not versions 0.40.1 or 1.3.0. I have updated the firmware versions of our boards in CI to 1.3.0 and we will work on making the update process automatic for future consistency.

jwnrt commented 8 months ago

FYI I've had to downgrade CI back to 1.2.0 because of a USB error that was introduced, so we might see this problem in CI again. Leaving the issue closed since we know the fix is to update the firmware.