Closed mkeeter closed 1 year ago
Testing PHY reinitialization with a long wait time (10s) still leaves the system stuck. This leads me to suspect the VSC7448, since it's not being power cycled, but getting ground-truth readings is going to be essential for debugging.
:( ok. Someone will need to probe the board in the office asap
OK so sad news, our osc3 does a thing that makes it output the incorrect frequency sometimes. See this note from microshit: https://ww1.microchip.com/downloads/en/DeviceDoc/DSC11xx-Family-Silicon-Errata-DS80000982A.pdf
Unfortunately the parts which tri-state are unobtainable or have very long lead times.
This is a happy clock:
This is a sad clock:
Unfortunately sometimes we see a sad clock...
The signal off freq which is a symptom of the above problems:
This is VSC7448 side of our link
Unfortunately, I don't think it's wise to plan any more rework on this pass of sidecar. As discussed on the hardware tactical today, given the ~2% boot failure rate, we're proposing that the sidecar power-cycle the qsfp board (software workaround) in the cases where this issue is detected. @arjenroodselaar is signed up to scope out that work.
This isn't awesome and we should consider using a different part in the future.
First, we are going to try and power cycle the Front IO board from Sidecar.
Alternatively, per our huddle, I plan to sever the FPGA's connection to the enable of our current osc (in a reparable way), and we will attempt to work with the VSC when we violate it's sequencing instructions (typically want's power before refclk)
An update on this issue; https://github.com/oxidecomputer/hubris/tree/front_io_bad_osc contains changes across the sequencer task, monorail task and the controller bitstreams to work around this issue. This is currently running in a loop where the system is power cycled and the links are checked afterwards. So far the monorail task has detected two instances where the QSGMII link did not come up and the front IO board needed to be power cycled and the PHY reinitialized to work around the problem. Afterwards the QSGMII link and technician ports worked as intended.
This will take a few days to get through review, but so far a software workaround seems adequate.
This ran overnight and 1464 power cycles of Sidecar were done. During 57 of those cycles monrail-server
determined the QSGMII link not functional and requested one or more power cycles of the front IO board from the sequencer
. Once the QSGMII link came up ping tests using both technician ports succeeded in all 1464 cycles.
Done in #1449
When running a hard reboot loop to debug #1399 , I found the system in a state where the link from the VSC7448 to the technician port PHY was down.
This is distinct from #1399, where the QSGMII link is fine, but the VSC7448 is dropping packets in its queue system.
We see various bits indicating that the QSGMII link can't sync up:
(the latter has bits set for "Comma realigned", "SerDes signal detect", and "MAC comma detect", but not "QSGMII sync status")
The failure was not resolved by reinitializing the PHY (using
Monorail.reinit
); it was only resolved by power-cycling the entire Sidecar. This is confusing, because reinitialization should also power-cycling the PHY. It's possible that the wait time of 10 ms isn't sufficient to fully discharge the rail.Next steps are:
cc @Aaron-Hartwig @refugeesus