Open mkeeter opened 1 year ago
It's odd that the device is talking on I2C enough to tell us it's an
SFF8636
module, but then has a handful of temperature read errors; the thermal loop is presumably spinning up because we have a new device with an unknown temperature, so it's in the BOOTING / "set fan speed to 100%" state.
Given we're going to always have customers plugging in stange and unexpected devices that we may not be using, is there a way to make sure that until we care about the device it doesn't accidentally cause us to go to 100%? In particular with plugging in a device that doesn't actually have temperature monitoring or just has a different interface for some reason.
The ImplError can be decoded from the front IO register guide: 0x04 = I2cAddressnack, 0x05 = I2cByteNack
This is a bit out of date, I owe some work here to better tie Quartz enums into Hubris which I think is now supported but I have not taken the time to figure out how to use. But the sentiment holds: The FPGA is failing to communicate with the device. 0x04
actually means that we tried to communicate with an uninitialized module, which means ResetL
was either still asserted, or hadn't been de-asserted for t_init
(2 seconds). From the given context of the fans spinning back down, it would seem that we were running into the latter case as the issue resolved itself once the FPGA attempted to communicate with the module. I will take the action to cleanup the enums + docs here to be consistent.
Given we're going to always have customers plugging in stange and unexpected devices that we may not be using, is there a way to make sure that until we care about the device it doesn't accidentally cause us to go to 100%?
The intention here is that there will be two paths when a module is inserted and ResetL
has been deasserted for t_init
:
I agree with the sentiment that we should not spin fans to 100% so easily (if at all?). I'd like to reach some sort of consensus on what to do here. My preference would be to "ignore" the module for the purpose of fan speed (no longer spin up if we can't control it), which would mean we wouldn't start revving up the fans for the 2 second t_init
interval just to let them slow back down again.
12 294 5 1 UnpluggedModule(0x4)
13 589 5 1 ModulePresenceUpdate(0xe017e017)
14 277 5 1 GetInterfaceError(0x4, ImplError(0x5))
15 238 5 1 GotInterface(0x4, Sff8636)
0 345 6 4 TemperatureReadError(0x4, ImplError(0x4))
ImplError(0x4)
= NotInitialized
ImplError(0x5)
= I2cAddressNack
It is strange here that we see GetInterfaceError(0x4, ImplError(0x5))
before a TemperatureReadError(0x4, ImplError(0x4))
as I don't expect us to get I2C errors prior to a module being initialized since we do not attempt I2C on uninitialized modules.
Additionally, seeing TemperatureReadError
after we successfully GotInterface
is unexpected.
The work @Aaron-Hartwig mentioned to turn off unsupported modules is currently being done in https://github.com/oxidecomputer/dendrite/pull/224. See this code for the current implementation.
I have a FPGA mitigation here which fixes a bug around "module initialization". The FPGA considers a module initialized after it has had ModResetL
deasserted for t_init
(2 seconds). Until a module has been initialized, the FPGA will block attempts to initiate I2C transactions and place a NotInitialized
(0x4
) error into the PORT_STATUS::Error
field. However, it was not taking ModPrsL
into account here, so removing a module deasserts ModPrsL
but ModResetL
remains asserted, resulting in the FPGA not enforcing the initialization period. That bug explains why we could see the ImplError(0x5)
(I2cAddressNack
) above. What I do not yet understand is how we then see subsequent ImplError(0x4)
(NotInitialized
) as that would imply the module had then be uninitialized. That could only have happened if ModResetL
was asserted again, but that would've resulted in another (perhaps misleadingly named) UnpluggedModule(0x4)
message in the ringbuf.
I don't have a complete answer here yet, but I do have a fix for an obvious bug, so I'm going to deploy that. I did do a small and perhaps not entirely representative test with the fix of having Alan pull module 4 again and reinsert it, and I didn't see any ImplError(0x5)
.
aaron@lurch ~ $ pfexec humility -t sidecar-sp -a /home/aaron/build-sidecar-b-image-default.zip ringbuf transceivers
humility: WARNING: archive on command-line overriding archive in environment file
humility: attached to 0483:374e:0028001E4741500720383733 via ST-Link V3
humility: ring buffer drv_transceivers_server::__RINGBUF in transceivers:
NDX LINE GEN COUNT PAYLOAD
0 626 1 1 FrontIOReady(true)
1 666 1 1 LEDInit
2 118 1 1 LEDInitComplete
3 604 1 1 ModulePresenceUpdate(LogicalPortMask(0xe077e077))
4 293 1 4 GetInterfaceError(0x4, ImplError(0x4))
5 254 1 1 GotInterface(0x4, Sff8636)
6 310 1 1 UnpluggedModule(0x4)
7 293 1 4 GetInterfaceError(0x4, ImplError(0x4))
8 254 1 1 GotInterface(0x4, Sff8636)
9 604 1 1 ModulePresenceUpdate(LogicalPortMask(0xe077e067))
10 310 1 1 UnpluggedModule(0x4)
11 604 1 1 ModulePresenceUpdate(LogicalPortMask(0xe077e077))
12 293 1 4 GetInterfaceError(0x4, ImplError(0x4))
13 254 1 1 GotInterface(0x4, Sff8636)
Given that this is a recoverable issue I'm voting to leave this open for now until I can devote the time and equipment to figuring out how to reproduce it. I have more pressing things I think I should work on in the meantime.
I still would like the fans to not ramp aggressively in the event of GetInterfaceError
and UnknownInterface
.
On the thermal
side on the SP, we may want to introduce some notion of the time an entity comes online, and how long we'll tolerate it being flaky before we treat it as a potential thermal excursion. I can definitely imagine ways you could implement a transciever such that it could respond to ident requests over I2C immediately, but take a bit to read a temperature -- as a crude example, if you were using any of the TI temperature sensors internally, they do internal averaging and may not read as valid for a short period after poweron.
We'd need to make sure we can reliably detect a thing being plugged/unplugged (for FRUs anyway) so that we don't treat flake as an unplug event and tolerate arbitrary weirdness.
DataNotReady
bit it mentions not having implemented.
Quoth @rmustacc,
He unplugged + replugged transceiver 4, which we see in the ringbuf:
The
ImplError
can be decoded from the front IO register guide:It's odd that the device is talking on I2C enough to tell us it's an
SFF8636
module, but then has a handful of temperature read errors; the thermal loop is presumably spinning up because we have a new device with an unknown temperature, so it's in the BOOTING / "set fan speed to 100%" state.