oxidecomputer / hubris

A lightweight, memory-protected, message-passing kernel for deeply embedded systems.
Mozilla Public License 2.0
2.94k stars 167 forks source link

dump-agent gets stuck in send to net #1797

Open hawkw opened 3 months ago

hawkw commented 3 months ago

On a system with a whole lot of dumps (e.g. the gimlet from #1779, which has a thermal task that's crashed 7483 times), running humility dump -l can time out. When this occurs, I've seen the hiffy task left in a bad state where no other commands that use hiffy can succeed.

For instance:

# running a humility i2c command, which uses hiffy, works fine
eliza@niles ~ $ pfexec humility -t gimlet-c i2c -b front --scan
humility: attached to 0483:3754:000D00344741500820383733 via ST-Link V3

Device scan on controller I2C2:

    R = Reserved   - = No device   \o/ = Device found   X = Timed out

ADDR     0x0 0x1 0x2 0x3 0x4 0x5 0x6 0x7 0x8 0x9 0xa 0xb 0xc 0xd 0xe 0xf
0x00       R   R   R   R   R   R   R   R   -   -   -   -   -   -   -   -
0x10       -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -
0x20       -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -
0x30       -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -
0x40       -   -   -   -   -   -   -   - \o/   -   -   -   -   -   -   -
0x50     \o/ \o/ \o/ \o/   -   -   -   - \o/ \o/ \o/ \o/   -   -   -   -
0x60       -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -
0x70       -   -   -   -   -   -   -   -   -   -   -   -   R   R   R   R

# now, i try listing dumps, which times out
eliza@niles ~ $ pfexec humility -t gimlet-c dump -l
humility: attached to 0483:3754:000D00344741500820383733 via ST-Link V3
humility: using hiffy dump agent
AREA TASK                  TIME       SIZE
humility dump failed: operation timed out

# running the same humility i2c command now fails
eliza@niles ~ $ pfexec humility -t gimlet-c i2c -b front --scan
humility: attached to 0483:3754:000D00344741500820383733 via ST-Link V3
humility i2c failed: HIF execution facility unavailable

# note that hiffy is stuck in recv from dump-agent, which is in send to net
eliza@niles ~ $ pfexec humility -t gimlet-c tasks
humility: attached to 0483:3754:000D00344741500820383733 via ST-Link V3
system time = 4215901
ID TASK                       GEN PRI STATE
 0 jefe                         0   0 recv, notif: fault timer(T+101)
 1 net                          0   5 notif: jefe-state-change
 2 sys                          0   1 recv, notif: exti-wildcard-irq(irq6/irq7/irq8/irq9/irq10/irq23/irq40)
 3 spi2_driver                  0   3 recv(gimlet_seq/gen0 only)
 4 i2c_driver                   0   3 recv
 5 spd                          0   2 notif: jefe-state-change
 6 packrat                      0   1 recv
 7 thermal                   7483   5 FAULT: panicked at task/thermal/src/bsp/gimlet_bcdef.rs:162:28:
explicit panic (was: ready)
 8 power                        0   6 wait: send to gimlet_seq/gen0
 9 hiffy                        1   5 wait: send to dump_agent/gen0
10 gimlet_seq                   0   4 notif: bit31(T+3)
11 gimlet_inspector             0   6 wait: send to net/gen0
12 hash_driver                  0   2 recv
13 hf                           0   3 notif: bit31(T+330)
14 update_server                0   3 recv
15 sensor                       0   4 recv
16 host_sp_comms                0   7 recv, notif: jefe-state-change usart-irq(irq82) multitimer control-plane-agent
17 udpecho                      0   6 wait: send to net/gen0
18 udpbroadcast                 0   6 wait: send to net/gen0
19 control_plane_agent          0   6 wait: send to net/gen0
20 sprot                        0   4 recv
21 validate                     0   5 recv
22 vpd                          0   4 recv
23 user_leds                    0   2 recv, notif: timer
24 dump_agent                   0   6 wait: send to net/gen0
25 sbrmi                        0   4 recv
26 idle                         0   8 RUNNING
27 udprpc                       0   6 wait: send to net/gen0

manually restarting the hiffy task using humility jefe -fault hiffy and humility jefe --release hiffy, seems to unstick it, until the next time that you run humility dump -l.

hawkw commented 3 months ago

not just dump -l, looks like any attempt to use the dump-agent on this system times out, since it's stuck talking to net.

eliza@niles ~ $ pfexec humility -t gimlet-c jefe -f hiffy
humility: attached to 0483:3754:000D00344741500820383733 via ST-Link V3
humility: successfully changed disposition for hiffy
eliza@niles ~ $ pfexec humility -t gimlet-c jefe -r hiffy
humility: attached to 0483:3754:000D00344741500820383733 via ST-Link V3
humility: successfully changed disposition for hiffy
eliza@niles ~ $ pfexec humility -t gimlet-c dump --all
humility dump failed: Probe could not be created

Caused by:
    Probe was not found.
eliza@niles ~ $ pfexec humility -t gimlet-c dump --all
humility: attached to 0483:3754:000D00344741500820383733 via ST-Link V3
humility: using hiffy dump agent
humility dump failed: operation timed out
eliza@niles ~ $
cbiffle commented 3 months ago

Here's the core, for posterity.

hubris.core.niles-sad.0.gz

The power sequencer is hung attempting to program the FPGA, which isn't responding by pulling CDONE low. This is blocking the net task from making progress, because it can't interact with its devices until power sequencing is complete. The dump-agent in turn is patiently waiting for that, for better or worse.

Here's the sequencer log:

humility: ring buffer drv_gimlet_seq_server::__RINGBUF in gimlet_seq:
   TOTAL VARIANT
  992167 Programming
       1 Ice40PowerGoodV3P3(true)
       1 IdentValid(false)
       1 ChecksumValid(false)
       1 Reprogram(true)
       1 Ice40Rails
       1 Ice40PowerGoodV1P2(true)
 NDX LINE      GEN    COUNT PAYLOAD
   0  264        1        1 Ice40Rails(true, true)
   1  293        1        1 Ice40PowerGoodV1P2(true)
   2  314        1        1 Ice40PowerGoodV3P3(true)
   3  356        1        1 IdentValid(false)
   4  359        1        1 ChecksumValid(false)
   5  362        1        1 Reprogram(true)
   6  376        1    65535 Programming
   7  376        1    65535 Programming
   8  376        1    65535 Programming
   9  376        1    65535 Programming
  10  376        1    65535 Programming
  11  376        1    65535 Programming
  12  376        1    65535 Programming
  13  376        1    65535 Programming
  14  376        1    65535 Programming
  15  376        1    65535 Programming
  16  376        1    65535 Programming
  17  376        1    65535 Programming
  18  376        1    65535 Programming
  19  376        1    65535 Programming
  20  376        1    65535 Programming
  21  376        1     9142 Programming

And based on its interactions with the SPI driver, it's locking and unlocking without sending data:

 fn Spi::lock() .................................................. 992167 calls
    clients:
    task gimlet_seq (0 restarts) ....................... = 0 ...... = 992167 ok

 fn Spi::release() ............................................... 992166 calls
    clients:
    task gimlet_seq (0 restarts) ....................... = 0 ...... = 992166 ok

 fn Spi::exchange() ................................................... 2 calls
    clients:
    task gimlet_seq (0 restarts) ....................... = 0 ........... = 2 ok

...which means, on review of the code, that it's failing this check:

    // At this point, the iCE40 is _supposed_ to be chilling in programming mode
    // listening for a bitstream. If this is the case it will be asserting
    // (holding low) CDONE. Let's check!
    if sys.gpio_read(config.cdone) != 0 {
        // Welp, that sure didn't work.
        return Err(Ice40Error::ChipNotListening);
    }
cbiffle commented 3 months ago

...so it turns out somebody flashed a Gimlet SP image onto a PSC SP board, and that's what we're observing here. The PSC doesn't have a sequencer FPGA, so this is getting stuck waiting for an acknowledge from a missing chip. That part is working roughly as expected.

We don't currently have a consistent "board ID pins" scheme between boards that could catch this problem.

cbiffle commented 3 months ago

On reflection, this machine will have been asserting the SP boot failure ignition fault pin, which is working basically as designed. FWIW.