oxidecomputer / hubris

A lightweight, memory-protected, message-passing kernel for deeply embedded systems.
Mozilla Public License 2.0
3.03k stars 180 forks source link

QSGMII link to front IO PHY sometimes doesn't come up #1410

Closed mkeeter closed 1 year ago

mkeeter commented 1 year ago

When running a hard reboot loop to debug #1399 , I found the system in a state where the link from the VSC7448 to the technician port PHY was down.

PORT | MODE    SPEED  DEV     SERDES  LINK |   PHY    MAC LINK  MEDIA LINK
-----|-------------------------------------|-------------------------------
 44  | QSGMII  1G     1G_20   6G_15   err  | VSC8562  err       down
 45  | QSGMII  1G     1G_21   6G_15   down | VSC8562  err       down

This is distinct from #1399, where the QSGMII link is fine, but the VSC7448 is dropping packets in its queue system.

We see various bits indicating that the QSGMII link can't sync up:

matt@lurch ~ (sidecar-sp) $ h monorail read HW_QSGMII_STAT[11]
humility: attached to 0483:374e:0028001E4741500720383733 via ST-Link V3
humility: Reading HSIO:HW_CFGSTAT:HW_QSGMII_STAT[11] from 0x714601a0
HSIO:HW_CFGSTAT:HW_QSGMII_STAT[11] => 0x20
  bits |    value   | field
   6:1 | 0x10       | DELAY_VAR_X200PS
     0 | 0x0        | SYNC
matt@lurch ~ (sidecar-sp) $ h monorail phy read -p44 MAC_SERDES_PCS_STATUS
humility: attached to 0483:374e:0028001E4741500720383733 via ST-Link V3
Reading from port 44 PHY, register EXTENDED_3:MAC_SERDES_PCS_STATUS
Got result 0xc405
  bits |    value   | field
    15 | 0x1        | MAC_SYNC_FAIL
    14 | 0x1        | MAC_CGBAD
    12 | 0x0        | SGMII_ALIGN_ERROR
    11 | 0x0        | MAC_LP_ANEG_RESTART
     5 | 0x0        | MAC_FDX_ADV
     4 | 0x0        | MAC_HDX_ADV
     3 | 0x0        | MAC_LP_ANEG_CAPABLE
     2 | 0x1        | MAC_LINK_STATUS
     1 | 0x0        | MAC_ANEG_COMPLETE
     0 | 0x1        | MAC_PCS_SIG_DETECT
matt@lurch ~ (sidecar-sp) $ h monorail phy read -p44 MAC_SERDES_STATUS
humility: attached to 0483:374e:0028001E4741500720383733 via ST-Link V3
Reading from port 44 PHY, register EXTENDED_3:MAC_SERDES_STATUS
Got result 0xd000

(the latter has bits set for "Comma realigned", "SerDes signal detect", and "MAC comma detect", but not "QSGMII sync status")

The failure was not resolved by reinitializing the PHY (using Monorail.reinit); it was only resolved by power-cycling the entire Sidecar. This is confusing, because reinitialization should also power-cycling the PHY. It's possible that the wait time of 10 ms isn't sufficient to fully discharge the rail.

Next steps are:

cc @Aaron-Hartwig @refugeesus

mkeeter commented 1 year ago

Testing PHY reinitialization with a long wait time (10s) still leaves the system stuck. This leads me to suspect the VSC7448, since it's not being power cycled, but getting ground-truth readings is going to be essential for debugging.

refugeesus commented 1 year ago

:( ok. Someone will need to probe the board in the office asap

refugeesus commented 1 year ago

OK so sad news, our osc3 does a thing that makes it output the incorrect frequency sometimes. See this note from microshit: https://ww1.microchip.com/downloads/en/DeviceDoc/DSC11xx-Family-Silicon-Errata-DS80000982A.pdf

Unfortunately the parts which tri-state are unobtainable or have very long lead times.

refugeesus commented 1 year ago

This is a happy clock: image

This is a sad clock: image

Unfortunately sometimes we see a sad clock...

refugeesus commented 1 year ago

The signal off freq which is a symptom of the above problems: image

This is VSC7448 side of our link

nathanaelhuffman commented 1 year ago

Unfortunately, I don't think it's wise to plan any more rework on this pass of sidecar. As discussed on the hardware tactical today, given the ~2% boot failure rate, we're proposing that the sidecar power-cycle the qsfp board (software workaround) in the cases where this issue is detected. @arjenroodselaar is signed up to scope out that work.

This isn't awesome and we should consider using a different part in the future.

refugeesus commented 1 year ago

First, we are going to try and power cycle the Front IO board from Sidecar.

Alternatively, per our huddle, I plan to sever the FPGA's connection to the enable of our current osc (in a reparable way), and we will attempt to work with the VSC when we violate it's sequencing instructions (typically want's power before refclk)

arjenroodselaar commented 1 year ago

An update on this issue; https://github.com/oxidecomputer/hubris/tree/front_io_bad_osc contains changes across the sequencer task, monorail task and the controller bitstreams to work around this issue. This is currently running in a loop where the system is power cycled and the links are checked afterwards. So far the monorail task has detected two instances where the QSGMII link did not come up and the front IO board needed to be power cycled and the PHY reinitialized to work around the problem. Afterwards the QSGMII link and technician ports worked as intended.

This will take a few days to get through review, but so far a software workaround seems adequate.

arjenroodselaar commented 1 year ago

This ran overnight and 1464 power cycles of Sidecar were done. During 57 of those cycles monrail-server determined the QSGMII link not functional and requested one or more power cycles of the front IO board from the sequencer. Once the QSGMII link came up ping tests using both technician ports succeeded in all 1464 cycles.

mkeeter commented 1 year ago

Done in #1449