Closed rmustacc closed 1 year ago
Which in turn kind of suggests to me that a bus lock is happening.
Dumps are on catacomb from 3 different occurrences in /data/staff/core/rack1/
as hubris.core.<sn>.badvpd
. Two spot checked had a similarly suspicious ringbuf entry:
humility: ring buffer drv_stm32xx_i2c_server::__RINGBUF in i2c_driver:
NDX LINE GEN COUNT PAYLOAD
1 162 1 1 Reset(I2C2, PortIndex(0x1))
2 536 1 2 SegmentFailed(BusResetMux)
3 162 1 1 Reset(I2C2, PortIndex(0x1))
4 536 1 1 SegmentFailed(BusResetMux)
5 536 1 1 SegmentFailed(BadMuxAddress)
6 334 1 1 Error
7 162 1 1 Reset(I2C3, PortIndex(0x0))
0 334 2 952 Error
If we had seen an I2C-level failure to read the EEPROM, we would have seen a ring buffer entry in drv_local_vpd
, but these all seem to be empty:
$ for f in *vpd ; do humility -d $f tasks net ; humility -d $f ringbuf drv_local_vpd; done
humility: attached to dump
system time = 1248384
ID TASK GEN PRI STATE
1 net 0 5 recv, notif: eth-irq(irq117) wake-timer(T+413)
humility: attached to dump
humility: ring buffer drv_local_vpd::__RINGBUF in net:
humility: ring buffer drv_local_vpd::__RINGBUF in control_plane_agent:
humility: attached to dump
system time = 1744761
ID TASK GEN PRI STATE
1 net 0 5 recv, notif: eth-irq(irq117)
humility: attached to dump
humility: ring buffer drv_local_vpd::__RINGBUF in net:
humility: ring buffer drv_local_vpd::__RINGBUF in control_plane_agent:
humility: attached to dump
system time = 742075
ID TASK GEN PRI STATE
1 net 0 5 recv, notif: eth-irq(irq117)
humility: attached to dump
humility: ring buffer drv_local_vpd::__RINGBUF in net:
humility: ring buffer drv_local_vpd::__RINGBUF in control_plane_agent:
Whatever is causing the VPD to be reverting to the default must be happening elsewhere -- and I am wondering if the mux issue is causing us to see the wrong EEPROM? One definite step to take: we should expand the ring buffer in the drv_local_vpd
driver to capture every error path.
(Edit: this was in response to @rmustacc above -- Github decided not to show me Bryan's 11 hour old comment before I posted for some reason.)
That ringbuf Error
entry occurs when the underlying write_read
operation returns an error, but frustratingly it does not indicate what error. Its comments suggest that the line number (334 here) should indicate it, but that line is recording an Error
entry in response to any error code from the next layer down, so it doesn't tell us.
That's probably worth fixing.
If we had seen an I2C-level failure to read the EEPROM, we would have seen a ring buffer entry in
drv_local_vpd
@bcantrill - This appears to be true for only a narrow subset of I2C failures. There are a lot of code paths in drv/local-vpd
that can return Err
without recording anything -- the ringbuf is only updated from the TlvcRead
implementation.
For best assurance, we probably want to record any errors that hit this point in the netstack -- currently they're being discarded by the .ok()
.
This issue is still with us! We saw this in the rack on BRM42220070 (dump):
$ humility -d ./hubris.core.BRM42220070.vpd-again.0 ringbuf drv_packrat_vpd_loader
humility: attached to dump
humility: ring buffer drv_packrat_vpd_loader::__RINGBUF in gimlet_seq:
NDX LINE GEN COUNT PAYLOAD
0 50 1 1 LocalVpdError(NoRootChunk)
1 67 1 1 LocalVpdError(NoRootChunk)
2 50 1 1 LocalVpdError(NoRootChunk)
3 67 1 1 LocalVpdError(NoRootChunk)
4 50 1 1 LocalVpdError(NoRootChunk)
5 67 1 1 LocalVpdError(NoRootChunk)
6 50 1 1 LocalVpdError(NoRootChunk)
7 67 1 1 LocalVpdError(NoRootChunk)
8 50 1 1 LocalVpdError(NoRootChunk)
9 67 1 1 LocalVpdError(NoRootChunk)
This is surprising, as it's not a device error, but rather seemingly getting the wrong data from the EEPROM. This eliminates one hypothesis: if we somehow had the wrong mux/segment selected, we would expect to still find a FRU0
tag, and get a LocalVpdError(NoSuchChunk)
. To explore this a bit better, I added slightly better error recording in #1277. I then put BRM42220067 in a reboot loop whereby we reset the SP when we hit A0PlusHP
; after about three hours it hit this condition (
dump):
$ humility -d ./hubris.core.BRM42220067.vpd-again.0 ringbuf drv_packrat_vpd_loader
humility: attached to dump
humility: ring buffer drv_packrat_vpd_loader::__RINGBUF in gimlet_seq:
NDX LINE GEN COUNT PAYLOAD
0 51 1 1 MacLocalVpdError(BadRootChunk)
1 68 1 1 BarcodeLocalVpdError(BadRootChunk)
2 51 1 1 MacLocalVpdError(BadRootChunk)
3 68 1 1 BarcodeLocalVpdError(BadRootChunk)
4 51 1 1 MacLocalVpdError(BadRootChunk)
5 68 1 1 BarcodeLocalVpdError(BadRootChunk)
6 51 1 1 MacLocalVpdError(BadRootChunk)
7 68 1 1 BarcodeLocalVpdError(BadRootChunk)
8 51 1 1 MacLocalVpdError(BadRootChunk)
9 68 1 1 BarcodeLocalVpdError(BadRootChunk)
This is surprising: it means we are getting the wrong data from the device -- and seemingly repeatedly. humility vpd
yielded:
$ humility vpd --read --id 10
humility: attached to 0483:374e:003400224741500820383733 via ST-Link V3
humility vpd failed: bad header: HeaderCorrupt { stored_checksum: db3bc619, computed_checksum: db3bc749 }
Checking a Sharkfin VPD, it seem intact:
$ humility vpd --read --id 0
humility: attached to 0483:374e:003400224741500820383733 via ST-Link V3
[
("FRU0", [
("BARC", [
"0XV1:9130000021:001:BRM39220519",
]),
]),
]
After running this, the Gimlet VPD data was retrieved correctly!
$ humility vpd --read --id 10
humility: attached to 0483:374e:003400224741500820383733 via ST-Link V3
[
("FRU0", [
("BARC", [
"0XV1:9130000019:006:BRM42220067",
]),
("MAC0", [
[
168,
64,
37,
4,
2,
5,
8,
0,
8,
],
]),
]),
]
From this point, the condition was seemingly chased away. Restarting gimlet_seq
resulted in a correct read of the VPD.
Something very odd is going on; reopening this issue, and getting a Saleae on BRM42220067 while I attempt to reproduce again.
At this point, we believe this issue to be resolved -- and if/as we see this again, a new issue is likely merited.
It's back!
See https://github.com/oxidecomputer/hubris/issues/1413; posting here to cross-link the issues.
While doing testing with rack 1 (and older RoTs) we've seen this while doing a loop across the fleet of the form:
for f in BRM42220060 BRM42220041 BRM42220020 BRM42220035 BRM42220052 BRM42220087 BRM42220077 BRM42220019 BRM42220029 BRM42220075 BRM42220003 BRM44220002 BRM42220089 BRM42220053 BRM42220012 BRM42220068 BRM42220064 BRM42220048 BRM42220056 BRM42220054 BRM42220007 BRM42220042 BRM42220062 BRM42220076 BRM42220044 BRM42220010 BRM42220005 BRM42220024 BRM42220030 BRM42220061; do pfexec pilot sp exec -e 'update sp 0 /data/local/rack1/gimlet/c/sp.10ms/build-gimlet-c.zip' $f && pfexec pilot sp exec -e 'reset' $f; done
I'm grabbing dumps and will upload them to catacomb, but recording this so we don't lose track. Assigning to MVP milestone right now. Feel free to move at your discretion.