rmustacc commented 1 year ago

While doing testing with rack 1 (and older RoTs) we've seen this while doing a loop across the fleet of the form: for f in BRM42220060 BRM42220041 BRM42220020 BRM42220035 BRM42220052 BRM42220087 BRM42220077 BRM42220019 BRM42220029 BRM42220075 BRM42220003 BRM44220002 BRM42220089 BRM42220053 BRM42220012 BRM42220068 BRM42220064 BRM42220048 BRM42220056 BRM42220054 BRM42220007 BRM42220042 BRM42220062 BRM42220076 BRM42220044 BRM42220010 BRM42220005 BRM42220024 BRM42220030 BRM42220061; do pfexec pilot sp exec -e 'update sp 0 /data/local/rack1/gimlet/c/sp.10ms/build-gimlet-c.zip' $f && pfexec pilot sp exec -e 'reset' $f; done

I'm grabbing dumps and will upload them to catacomb, but recording this so we don't lose track. Assigning to MVP milestone right now. Feel free to move at your discretion.

cbiffle commented 1 year ago

1171 doesn't look like a fix for this, it changes the I2C driver to be more tolerant of errors selecting muxes. To avoid coming up with a self-assigned address, the netstack would need to block startup until VPD is available, which could be forever if the bus is locked.

Which in turn kind of suggests to me that a bus lock is happening.

rmustacc commented 1 year ago

Dumps are on catacomb from 3 different occurrences in /data/staff/core/rack1/ as hubris.core.<sn>.badvpd. Two spot checked had a similarly suspicious ringbuf entry:

humility: ring buffer drv_stm32xx_i2c_server::__RINGBUF in i2c_driver:
 NDX LINE      GEN    COUNT PAYLOAD
   1  162        1        1 Reset(I2C2, PortIndex(0x1))
   2  536        1        2 SegmentFailed(BusResetMux)
   3  162        1        1 Reset(I2C2, PortIndex(0x1))
   4  536        1        1 SegmentFailed(BusResetMux)
   5  536        1        1 SegmentFailed(BadMuxAddress)
   6  334        1        1 Error
   7  162        1        1 Reset(I2C3, PortIndex(0x0))
   0  334        2      952 Error

bcantrill commented 1 year ago

If we had seen an I2C-level failure to read the EEPROM, we would have seen a ring buffer entry in drv_local_vpd, but these all seem to be empty:

$ for f in *vpd ; do humility -d $f tasks net ; humility -d $f ringbuf drv_local_vpd; done
humility: attached to dump
system time = 1248384
ID TASK                       GEN PRI STATE    
 1 net                          0   5 recv, notif: eth-irq(irq117) wake-timer(T+413)
humility: attached to dump
humility: ring buffer drv_local_vpd::__RINGBUF in net:
humility: ring buffer drv_local_vpd::__RINGBUF in control_plane_agent:
humility: attached to dump
system time = 1744761
ID TASK                       GEN PRI STATE    
 1 net                          0   5 recv, notif: eth-irq(irq117)
humility: attached to dump
humility: ring buffer drv_local_vpd::__RINGBUF in net:
humility: ring buffer drv_local_vpd::__RINGBUF in control_plane_agent:
humility: attached to dump
system time = 742075
ID TASK                       GEN PRI STATE    
 1 net                          0   5 recv, notif: eth-irq(irq117)
humility: attached to dump
humility: ring buffer drv_local_vpd::__RINGBUF in net:
humility: ring buffer drv_local_vpd::__RINGBUF in control_plane_agent:

Whatever is causing the VPD to be reverting to the default must be happening elsewhere -- and I am wondering if the mux issue is causing us to see the wrong EEPROM? One definite step to take: we should expand the ring buffer in the drv_local_vpd driver to capture every error path.

cbiffle commented 1 year ago

(Edit: this was in response to @rmustacc above -- Github decided not to show me Bryan's 11 hour old comment before I posted for some reason.)

That ringbuf Error entry occurs when the underlying write_read operation returns an error, but frustratingly it does not indicate what error. Its comments suggest that the line number (334 here) should indicate it, but that line is recording an Error entry in response to any error code from the next layer down, so it doesn't tell us.

That's probably worth fixing.

cbiffle commented 1 year ago

If we had seen an I2C-level failure to read the EEPROM, we would have seen a ring buffer entry in drv_local_vpd

@bcantrill - This appears to be true for only a narrow subset of I2C failures. There are a lot of code paths in drv/local-vpd that can return Err without recording anything -- the ringbuf is only updated from the TlvcRead implementation.

For best assurance, we probably want to record any errors that hit this point in the netstack -- currently they're being discarded by the .ok().

bcantrill commented 1 year ago

This issue is still with us! We saw this in the rack on BRM42220070 (dump):

$ humility -d ./hubris.core.BRM42220070.vpd-again.0 ringbuf drv_packrat_vpd_loader
humility: attached to dump
humility: ring buffer drv_packrat_vpd_loader::__RINGBUF in gimlet_seq:
 NDX LINE      GEN    COUNT PAYLOAD
   0   50        1        1 LocalVpdError(NoRootChunk)
   1   67        1        1 LocalVpdError(NoRootChunk)
   2   50        1        1 LocalVpdError(NoRootChunk)
   3   67        1        1 LocalVpdError(NoRootChunk)
   4   50        1        1 LocalVpdError(NoRootChunk)
   5   67        1        1 LocalVpdError(NoRootChunk)
   6   50        1        1 LocalVpdError(NoRootChunk)
   7   67        1        1 LocalVpdError(NoRootChunk)
   8   50        1        1 LocalVpdError(NoRootChunk)
   9   67        1        1 LocalVpdError(NoRootChunk)

This is surprising, as it's not a device error, but rather seemingly getting the wrong data from the EEPROM. This eliminates one hypothesis: if we somehow had the wrong mux/segment selected, we would expect to still find a FRU0 tag, and get a LocalVpdError(NoSuchChunk). To explore this a bit better, I added slightly better error recording in #1277. I then put BRM42220067 in a reboot loop whereby we reset the SP when we hit A0PlusHP; after about three hours it hit this condition ( dump):

$ humility -d ./hubris.core.BRM42220067.vpd-again.0 ringbuf drv_packrat_vpd_loader
humility: attached to dump
humility: ring buffer drv_packrat_vpd_loader::__RINGBUF in gimlet_seq:
 NDX LINE      GEN    COUNT PAYLOAD
   0   51        1        1 MacLocalVpdError(BadRootChunk)
   1   68        1        1 BarcodeLocalVpdError(BadRootChunk)
   2   51        1        1 MacLocalVpdError(BadRootChunk)
   3   68        1        1 BarcodeLocalVpdError(BadRootChunk)
   4   51        1        1 MacLocalVpdError(BadRootChunk)
   5   68        1        1 BarcodeLocalVpdError(BadRootChunk)
   6   51        1        1 MacLocalVpdError(BadRootChunk)
   7   68        1        1 BarcodeLocalVpdError(BadRootChunk)
   8   51        1        1 MacLocalVpdError(BadRootChunk)
   9   68        1        1 BarcodeLocalVpdError(BadRootChunk)

This is surprising: it means we are getting the wrong data from the device -- and seemingly repeatedly. humility vpd yielded:

$ humility vpd --read --id 10
humility: attached to 0483:374e:003400224741500820383733 via ST-Link V3
humility vpd failed: bad header: HeaderCorrupt { stored_checksum: db3bc619, computed_checksum: db3bc749 }

Checking a Sharkfin VPD, it seem intact:

$ humility vpd --read --id 0
humility: attached to 0483:374e:003400224741500820383733 via ST-Link V3
[
    ("FRU0", [
        ("BARC", [
            "0XV1:9130000021:001:BRM39220519",
        ]),
    ]),
]

After running this, the Gimlet VPD data was retrieved correctly!

$ humility vpd --read --id 10
humility: attached to 0483:374e:003400224741500820383733 via ST-Link V3
[
    ("FRU0", [
        ("BARC", [
            "0XV1:9130000019:006:BRM42220067",
        ]),
        ("MAC0", [
            [
                168,
                64,
                37,
                4,
                2,
                5,
                8,
                0,
                8,
            ],
        ]),
    ]),
]

From this point, the condition was seemingly chased away. Restarting gimlet_seq resulted in a correct read of the VPD.

Something very odd is going on; reopening this issue, and getting a Saleae on BRM42220067 while I attempt to reproduce again.

bcantrill commented 1 year ago

At this point, we believe this issue to be resolved -- and if/as we see this again, a new issue is likely merited.

mkeeter commented 1 year ago

It's back!

See https://github.com/oxidecomputer/hubris/issues/1413; posting here to cross-link the issues.

oxidecomputer / hubris

SP still comes up with self-assigned MAC address after #1171 #1196

1171 doesn't look like a fix for this, it changes the I2C driver to be more tolerant of errors selecting muxes. To avoid coming up with a self-assigned address, the netstack would need to block startup until VPD is available, which could be forever if the bus is locked.