oxidecomputer / hubris

A lightweight, memory-protected, message-passing kernel for deeply embedded systems.
Mozilla Public License 2.0
2.96k stars 167 forks source link

Possible `i2c_driver` hang on BRM42220026 #1631

Open mkeeter opened 7 months ago

mkeeter commented 7 months ago

Dumps are in /staff/core/rack2/BRM42220026/20240226/hubris.core.{0, 1}

@leftwo noticed the fans spinning up on this Gimlet. It looks like many tasks are hung waiting for the i2c_driver.

system time = 1207776588
ID TASK                       GEN PRI STATE
 0 jefe                         0   0 recv, notif: fault timer(T+12)
 1 net                          0   5 recv, notif: eth-irq(irq61) wake-timer(T+62)
 2 sys                          0   1 recv
 3 spi2_driver                  0   3 recv
 4 i2c_driver                   0   3 notif: i2c2-irq(irq33/irq34)
 5 spd                          0   2 notif: i2c1-irq(irq31/irq32)
 6 packrat                      0   1 recv
 7 thermal                      0   5 wait: send to i2c_driver/gen0
 8 power                        0   6 wait: send to gimlet_seq/gen0
 9 hiffy                        0   5 notif: bit31(T+171)
10 gimlet_seq                   0   4 wait: send to i2c_driver/gen0
11 hash_driver                  0   2 recv
12 hf                           0   3 recv
13 update_server                0   3 recv
14 sensor                       0   4 recv, notif: timer(T+419)
15 host_sp_comms                0   7 wait: reply from gimlet_seq/gen0
16 udpecho                      0   6 notif: socket
17 udpbroadcast                 0   6 notif: bit31(T+57)
18 control_plane_agent          0   6 wait: reply from validate/gen0
19 sprot                        0   4 notif: bit31(T+1)
20 validate                     0   5 wait: reply from i2c_driver/gen0
21 vpd                          0   4 recv
22 user_leds                    0   2 recv, notif: timer
23 dump_agent                   0   6 wait: reply from sprot/gen0
24 sbrmi                        0   4 recv
25 idle                         0   8 RUNNING

The i2c_driver is just chilling out, waiting for a notification:

matt@jeeves ~ () $ h -d /staff/core/rack2/BRM42220026/20240226/hubris.core.0 tasks -sl i2c_driver
humility: attached to dump
system time = 1207776588
ID TASK                       GEN PRI STATE
 4 i2c_driver                   0   3 notif: i2c2-irq(irq33/irq34)
   |
   +--->  0x2403b200 0x080aa9c0 userlib::sys_recv_stub
                     @ /hubris/sys/userlib/src/lib.rs:332
          0x2403b220 0x080a809c core::ops::function::FnOnce::call_once
                     @ /rustc/95a3a7277b44bbd2dd3485703d9a05f64652b60e/library/core/src/ops/function.rs:251
          0x2403b380 0x080a8e1e drv_stm32xx_i2c::I2cController::write_read
                     @ /hubris/drv/stm32xx-i2c/src/lib.rs:534
          0x2403b380 0x080a8a2a drv_stm32xx_i2c_server::main::{{closure}}
                     @ /hubris/drv/stm32xx-i2c-server/src/main.rs:358
          0x2403b380 0x080a8a2a userlib::hl::recv_without_notification::{{closure}}
                     @ /hubris/sys/userlib/src/hl.rs:130
          0x2403b380 0x080a89cc userlib::hl::recv
                     @ /hubris/sys/userlib/src/hl.rs:83
          0x2403b380 0x080a89cc userlib::hl::recv_without_notification
                     @ /hubris/sys/userlib/src/hl.rs:123
          0x2403b380 0x080a8ef4 main
                     @ /hubris/drv/stm32xx-i2c-server/src/main.rs:322

Taking two dumps, there's no change to i2c_driver's ringbufs, so it's seemingly not making forward progress.

The ringbufs indicate that it's Very Mad:

 NDX LINE      GEN    COUNT PAYLOAD
  ... a few dozen lines elided...
 144  262      836        1 ResetMux(0x73)
 145  392      836        1 MuxError(BusLockedMux)
 146  252      836        1 Reset((I2C2, PortIndex(0x0)))
 147  262      836        1 ResetMux(0x73)
 148  392      836        1 MuxError(BusLockedMux)
 149  252      836        1 Reset((I2C2, PortIndex(0x0)))
 150  262      836        1 ResetMux(0x73)
 151  190      836        1 MuxUnknownRecover((I2C2, PortIndex(0x0)))
 152  392      836        1 MuxError(BusLockedMux)
 153  252      836        1 Reset((I2C2, PortIndex(0x0)))
 154  262      836        1 ResetMux(0x73)
 155  392      836        1 MuxError(BusLockedMux)
 156  252      836        1 Reset((I2C2, PortIndex(0x0)))
 157  262      836        1 ResetMux(0x73)
 158  392      836        1 MuxError(BusLockedMux)
 159  252      836        1 Reset((I2C2, PortIndex(0x0)))
 160  262      836        1 ResetMux(0x73)
 161  190      836        1 MuxUnknownRecover((I2C2, PortIndex(0x0)))

As far as I can tell, the only mux at address 0x73 is the M.2 mux:

 [[config.i2c.devices]]
bus = "m2"
address = 0x73
device = "pca9545"
description = "M.2 mux"
refdes = "U422"

I2C2, PortIndex(0x0) matches the M.2 bus in the manifest:

#
# I2C2: Front/M.2 bus
#
[[config.i2c.controllers]]
controller = 2

#
# SMBUS_SP_TO_M2_SMCLK_A2_V3P3
# SMBUS_SP_TO_M2_SMDAT_A2_V3P3
#
[config.i2c.controllers.ports.B]
name = "m2"
description = "M.2 bus"
scl.pin = 10
sda.pin = 11
af = 4
cbiffle commented 6 months ago

If you catch a machine in this state, the first thing I'd probably do is to dump the I2C controller state. (This is not recorded in the core file.) The I2C controller is sensitive to when exactly interrupt sources are / are not enabled. The first 7 32-bit registers would suffice.

The humility subcommand would be, I believe, readmem 0x40005800 -w 28

That should produce registers in the order: CR1, CR2, OAR1, OAR2, TIMINGR, TIMEOUTR, ISR, ICR, PECR. The contents of ISR (which events are pending) and CR1 (which interrupts are enabled) will be the interesting bit, I included the others just in case.

Because Humility still doesn't interpret the DWARF line number tables correctly, the line number in the i2c driver could be one of four potential wait points. addr2line on the binary would get you an exact line number, specifically addr2line -e path/to/the/i2c/elf/file -i 0x0x080a8e1e. Depending on which wait point the driver is parked at, we should see a different set of ISR/CR1 bits set. We can probably also distinguish the wait point by the other ringbuf, the one from drv_stm32xx_i2c instead of ..._server: based on whether the final entry is WriteISR, WriteWaitISR, ReadWaitISR, none of these or something else (line 664), or (unlikely) KonamiISR. So, whichever of those methods is easiest should disambiguate it.

cbiffle commented 6 months ago

There's another thing that I've been slightly concerned about, which is that the routine where that driver is parked consistently does this:

                (ctrl.wfi)(notification);
                (ctrl.enable)(notification);

...with the implication that, if control somehow gets there without enabling the interrupt source in the kernel, it'll just die. So far all code paths into this routine in practice have crossed a call to ctrl.enable, so it hasn't been a bug so far -- but that's the sort of thing that's hard to establish through local reasoning, and might not hold in all cases, or could rot under maintenance.

The source of truth here is the NVIC's interrupt enable bits, in ISER0-15, which you can extract from a live machine (not a dump!) by running readmem 0xE000E100 -w 16. Or for I2C2's interrupts specifically, readmem 0xE000E104 -w 4; bits 1 and 2 (in the sense of 2**1 and 2**2) in that word are the enable bits. They should both be set. If they are not set, we've made it to the wfi point without crossing enable.

cbiffle commented 6 months ago

While we're collecting interrupt controller registers, if the bits in ISER1 are clear, we should also capture ISPR1 at 0xE000E204. That will show whether the hardware is trying to produce an interrupt that we're not recognizing.

cbiffle commented 6 months ago

To summarize,

humility readmem -w 0x40005800 28
humility readmem 0xE000E104 -w 4
humility readmem 0xE000E204 -w 4
bcantrill commented 6 months ago

I have addressed this in #1657. If/when we see this again, we will get an i2c_driver dump that we can then extract. I induced this manually by forcing the I2C timeout functionality off (very much not the default!), trimming the timeout a bit, and then initiating a transaction that is known to result in a target misbehaving (thank you, BMR491!). The resulting dump looks like this:

$ humility -d ./hubris.core.i2c_driver.0 ringbuf
humility: attached to dump
humility: ring buffer drv_stm32xx_i2c::__RINGBUF in i2c_driver:
 NDX LINE      GEN    COUNT PAYLOAD
  38  686     1058        2 WriteWait(ISR, 0x8021)
  39  686     1058        1 WriteWait(ISR, 0x8061)
  40  751     1058        2 Read(ISR, 0x8021)
  41  751     1058        1 Read(ISR, 0x8025)
  42  751     1058        2 Read(ISR, 0x8021)
  43  751     1058        1 Read(ISR, 0x8025)
  44  794     1058        1 ReadWait(ISR, 0x8061)
  45  570     1058        1 Wait(ISR, 0x21)
  46  656     1058        1 Write(ISR, 0x21)
  47  656     1058        2 Write(ISR, 0x8021)
   0  656     1059        1 Write(ISR, 0x8023)
   1  686     1059        1 WriteWait(ISR, 0x8020)
   2  686     1059        2 WriteWait(ISR, 0x8021)
   3  686     1059        1 WriteWait(ISR, 0x8061)
   4  751     1059        2 Read(ISR, 0x8021)
   5  751     1059        1 Read(ISR, 0x8025)
   6  751     1059        2 Read(ISR, 0x8021)
   7  751     1059        1 Read(ISR, 0x8025)
   8  794     1059        1 ReadWait(ISR, 0x8061)
   9  570     1059        1 Wait(ISR, 0x21)
  10  656     1059        1 Write(ISR, 0x21)
  11  656     1059        2 Write(ISR, 0x8021)
  12  656     1059        1 Write(ISR, 0x8023)
  13  686     1059        1 WriteWait(ISR, 0x8020)
  14  686     1059        2 WriteWait(ISR, 0x8021)
  15  686     1059        1 WriteWait(ISR, 0x8061)
  16  751     1059        2 Read(ISR, 0x8021)
  17  751     1059        1 Read(ISR, 0x8025)
  18  751     1059        2 Read(ISR, 0x8021)
  19  751     1059        1 Read(ISR, 0x8025)
  20  794     1059        1 ReadWait(ISR, 0x8061)
  21  570     1059        1 Wait(ISR, 0x21)
  22  656     1059        1 Write(ISR, 0x21)
  23  656     1059        2 Write(ISR, 0x8021)
  24  656     1059        1 Write(ISR, 0x8023)
  25  686     1059        1 WriteWait(ISR, 0x8020)
  26  686     1059        2 WriteWait(ISR, 0x8021)
  27  686     1059        1 WriteWait(ISR, 0x8061)
  28  751     1059        2 Read(ISR, 0x8021)
  29  545     1059        1 LostInterrupt
  30  508     1059        1 Panic(CR1, 0x1000d7)
  31  509     1059        1 Panic(CR2, 0x124ce)
  32  510     1059        1 Panic(OAR1, 0x0)
  33  511     1059        1 Panic(OAR2, 0x0)
  34  512     1059        1 Panic(TIMINGR, 0x3060767f)
  35  513     1059        1 Panic(TIMEOUTR, 0x0)
  36  514     1059        1 Panic(ISR, 0x8021)
  37  515     1059        1 Panic(PECR, 0x0)

This doesn't capture the NVIC state that @cbiffle suggested (that's privileged state and there isn't a kernel interface to pull it), but this will give us much more information if/when we see this again. And, it should be said: because it will restart the i2c_driver, it will very likely allow the system to recover (and if it does not, that will be another interesting data point!).

cbiffle commented 6 months ago

We could graft on a kernel interface for collecting it, I suppose. I think the state is important to pull, because if we really do have a code path that hits wfi without enable -- for example -- we're going to have a hard time discovering it without it.

hawkw commented 6 months ago

PR #1659 adds the syscall for tasks to read their IRQ states, and #1666 ( :metal: ) adds it to the state we dump on panics.