Closed timothytrippel closed 8 months ago
Just some background on how we narrowed down the issue and a small caveat. Since the path connecting the FPGA UART pin to the SAM3X is on the PCB with no header to connect to, we were no able to directly analyze the signal with a logic analyzer. However, we first confirmed using wireshark and tracing the USB packets that the extra 0xff
was never sent by the host.
After that, Adrian from our team had the following clever suggestion: setup the UART1 is line loopback mode and using the pinmux, connect the UART1 RX to the same pin as UART0 RX (ie the SAM3x TXD). Then connect the UART1 TX to a pin on the PMOD header. Now we have a "copy" of the SAM3x TXD on the PMOD header and we can attach a logic analyzer. We were able to confirm that the extra 0xff
was indeed on the wire.
Now there is a small caveat to that: technically we are not seeing exactly the SAM3x TXD, we are seeing what the UART think it is seeing so any problem in the pinmux or on the sync flops in the UART could cause an issue. In particular, note that 0xff
is literally just a high-low-high signal so a glitch would be identified as 0xff
. Nevertheless, this seems unlikely because on the analyzer, we saw back-to-back characters with perfect timing including the 0xff
so it would be incredibly unlikely a glitch was inserted with exactly the length of character:
We found this flakiness to be reliably reproducible on version 1.2.0 of the CW310's SAM3X firmware, but not versions 0.40.1 or 1.3.0. I have updated the firmware versions of our boards in CI to 1.3.0 and we will work on making the update process automatic for future consistency.
FYI I've had to downgrade CI back to 1.2.0 because of a USB error that was introduced, so we might see this problem in CI again. Leaving the issue closed since we know the fix is to update the firmware.
Hierarchy of regression failure
Chip Level
Failure Description
There appears to be a flakiness issue with some tests that make use of UART communication between the host and device (e.g., ones that use ujson). Specifically, in some cases, the host seems to inject 0xff bytes in the UART stream. @pamaury and @jwnrt have more details on the exact issue at hand here.
The issue goes away on the hyper310 platform, suggesting it may be an issue with the SAM3X firmware. Additionally, it seems with older versions of the SAM3X firmware (I.e., version
0.40.1
) does not cause the issue, while version1.2.0
does.The current OpenTitan code base uses a pinned version of the
chipwhisperer-minimal
python package that contains version1.2.0
of the SAM3X firmware, however, the SAM3X firmware on the CI FPGAs is not updated between CI runs, so simply updating the the pinnedchipwhisperer-minimal
python package will not have an effect. Moreover, it does not seem the tip of tree of thechipwhisperer-minimal
python package repo has any newer SAM3X firmware anyway.There are a few current suggestions for mitigating this issue:
cw310
interface to use thehyper310
interface instead); @a-will I thought you tried this once but ran into issues?0.40.1
;chipwhisperer-minimal
python package repo commit with this version is here.In the interim, I filed #22153 to convert a test that was exhibiting this issue to solely use the
hyper310
interface.Steps to Reproduce
1.2.0
; This is easy to do withbazel run //sw/host/opentitantool -- --interface=cw310 fpga get-sam3x-fw-version
, once #22150 merges.bazel test --test_strategy=exclusive --test_output=errors --runs_per_test=50 -t- //sw/device/tests:example_sival_fpga_cw310_rom_with_fake_keys
@pamaury and @jwnrt can explain how they were able to pinout signals on the PMOD connector to tap, and the challenges they are currently encountering with doing so.
Tests with similar or related failures
Maybe others (@pamaury @jwnrt)?