Open mkeeter opened 7 months ago
If you catch a machine in this state, the first thing I'd probably do is to dump the I2C controller state. (This is not recorded in the core file.) The I2C controller is sensitive to when exactly interrupt sources are / are not enabled. The first 7 32-bit registers would suffice.
The humility subcommand would be, I believe, readmem 0x40005800 -w 28
That should produce registers in the order: CR1, CR2, OAR1, OAR2, TIMINGR, TIMEOUTR, ISR, ICR, PECR. The contents of ISR (which events are pending) and CR1 (which interrupts are enabled) will be the interesting bit, I included the others just in case.
Because Humility still doesn't interpret the DWARF line number tables correctly, the line number in the i2c driver could be one of four potential wait points. addr2line
on the binary would get you an exact line number, specifically addr2line -e path/to/the/i2c/elf/file -i 0x0x080a8e1e
. Depending on which wait point the driver is parked at, we should see a different set of ISR/CR1 bits set. We can probably also distinguish the wait point by the other ringbuf, the one from drv_stm32xx_i2c
instead of ..._server
: based on whether the final entry is WriteISR
, WriteWaitISR
, ReadWaitISR
, none of these or something else (line 664), or (unlikely) KonamiISR
. So, whichever of those methods is easiest should disambiguate it.
There's another thing that I've been slightly concerned about, which is that the routine where that driver is parked consistently does this:
(ctrl.wfi)(notification);
(ctrl.enable)(notification);
...with the implication that, if control somehow gets there without enabling the interrupt source in the kernel, it'll just die. So far all code paths into this routine in practice have crossed a call to ctrl.enable
, so it hasn't been a bug so far -- but that's the sort of thing that's hard to establish through local reasoning, and might not hold in all cases, or could rot under maintenance.
The source of truth here is the NVIC's interrupt enable bits, in ISER0-15, which you can extract from a live machine (not a dump!) by running readmem 0xE000E100 -w 16
. Or for I2C2's interrupts specifically, readmem 0xE000E104 -w 4
; bits 1 and 2 (in the sense of 2**1
and 2**2
) in that word are the enable bits. They should both be set. If they are not set, we've made it to the wfi
point without crossing enable
.
While we're collecting interrupt controller registers, if the bits in ISER1 are clear, we should also capture ISPR1 at 0xE000E204
. That will show whether the hardware is trying to produce an interrupt that we're not recognizing.
To summarize,
humility readmem -w 0x40005800 28
humility readmem 0xE000E104 -w 4
humility readmem 0xE000E204 -w 4
I have addressed this in #1657. If/when we see this again, we will get an i2c_driver
dump that we can then extract. I induced this manually by forcing the I2C timeout functionality off (very much not the default!), trimming the timeout a bit, and then initiating a transaction that is known to result in a target misbehaving (thank you, BMR491!). The resulting dump looks like this:
$ humility -d ./hubris.core.i2c_driver.0 ringbuf
humility: attached to dump
humility: ring buffer drv_stm32xx_i2c::__RINGBUF in i2c_driver:
NDX LINE GEN COUNT PAYLOAD
38 686 1058 2 WriteWait(ISR, 0x8021)
39 686 1058 1 WriteWait(ISR, 0x8061)
40 751 1058 2 Read(ISR, 0x8021)
41 751 1058 1 Read(ISR, 0x8025)
42 751 1058 2 Read(ISR, 0x8021)
43 751 1058 1 Read(ISR, 0x8025)
44 794 1058 1 ReadWait(ISR, 0x8061)
45 570 1058 1 Wait(ISR, 0x21)
46 656 1058 1 Write(ISR, 0x21)
47 656 1058 2 Write(ISR, 0x8021)
0 656 1059 1 Write(ISR, 0x8023)
1 686 1059 1 WriteWait(ISR, 0x8020)
2 686 1059 2 WriteWait(ISR, 0x8021)
3 686 1059 1 WriteWait(ISR, 0x8061)
4 751 1059 2 Read(ISR, 0x8021)
5 751 1059 1 Read(ISR, 0x8025)
6 751 1059 2 Read(ISR, 0x8021)
7 751 1059 1 Read(ISR, 0x8025)
8 794 1059 1 ReadWait(ISR, 0x8061)
9 570 1059 1 Wait(ISR, 0x21)
10 656 1059 1 Write(ISR, 0x21)
11 656 1059 2 Write(ISR, 0x8021)
12 656 1059 1 Write(ISR, 0x8023)
13 686 1059 1 WriteWait(ISR, 0x8020)
14 686 1059 2 WriteWait(ISR, 0x8021)
15 686 1059 1 WriteWait(ISR, 0x8061)
16 751 1059 2 Read(ISR, 0x8021)
17 751 1059 1 Read(ISR, 0x8025)
18 751 1059 2 Read(ISR, 0x8021)
19 751 1059 1 Read(ISR, 0x8025)
20 794 1059 1 ReadWait(ISR, 0x8061)
21 570 1059 1 Wait(ISR, 0x21)
22 656 1059 1 Write(ISR, 0x21)
23 656 1059 2 Write(ISR, 0x8021)
24 656 1059 1 Write(ISR, 0x8023)
25 686 1059 1 WriteWait(ISR, 0x8020)
26 686 1059 2 WriteWait(ISR, 0x8021)
27 686 1059 1 WriteWait(ISR, 0x8061)
28 751 1059 2 Read(ISR, 0x8021)
29 545 1059 1 LostInterrupt
30 508 1059 1 Panic(CR1, 0x1000d7)
31 509 1059 1 Panic(CR2, 0x124ce)
32 510 1059 1 Panic(OAR1, 0x0)
33 511 1059 1 Panic(OAR2, 0x0)
34 512 1059 1 Panic(TIMINGR, 0x3060767f)
35 513 1059 1 Panic(TIMEOUTR, 0x0)
36 514 1059 1 Panic(ISR, 0x8021)
37 515 1059 1 Panic(PECR, 0x0)
This doesn't capture the NVIC state that @cbiffle suggested (that's privileged state and there isn't a kernel interface to pull it), but this will give us much more information if/when we see this again. And, it should be said: because it will restart the i2c_driver
, it will very likely allow the system to recover (and if it does not, that will be another interesting data point!).
We could graft on a kernel interface for collecting it, I suppose. I think the state is important to pull, because if we really do have a code path that hits wfi without enable -- for example -- we're going to have a hard time discovering it without it.
PR #1659 adds the syscall for tasks to read their IRQ states, and #1666 ( :metal: ) adds it to the state we dump on panics.
Dumps are in
/staff/core/rack2/BRM42220026/20240226/hubris.core.{0, 1}
@leftwo noticed the fans spinning up on this Gimlet. It looks like many tasks are hung waiting for the
i2c_driver
.The
i2c_driver
is just chilling out, waiting for a notification:Taking two dumps, there's no change to
i2c_driver
's ringbufs, so it's seemingly not making forward progress.The ringbufs indicate that it's Very Mad:
As far as I can tell, the only mux at address 0x73 is the M.2 mux:
I2C2, PortIndex(0x0)
matches the M.2 bus in the manifest: