oxidecomputer / hubris

A lightweight, memory-protected, message-passing kernel for deeply embedded systems.
Mozilla Public License 2.0
3.01k stars 174 forks source link

Panic while dumping #1558

Closed mkeeter closed 1 year ago

mkeeter commented 1 year ago

Sometimes, when we ask the RoT to dump the SP, it doesn't do it.

andrew@jeeves ~ $ humility -i fe80::aa40:25ff:fe04:386%madrid_tp0 -a ~/build-gimlet-c-image-default-v1.0.2.zip dump
humility: connecting to fe80::aa40:25ff:fe04:386%madrid_tp0
humility: using UDP dump agent
humility: initializing dump agent state
humility: initializing segments
humility: taking dump; target will be stopped for ~20 seconds
thread 'main' panicked at 'index out of bounds: the len is 0 but the index is 0', /home/matt/humility/humility-dump-agent/src/lib.rs:536:35
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'main' panicked at 'index out of bounds: the len is 0 but the index is 0', /home/matt/humility/humility-dump-agent/src/lib.rs:536:35
stack backtrace:
   0: rust_begin_unwind
             at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:593:5
   1: core::panicking::panic_fmt
             at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/panicking.rs:67:14
   2: core::panicking::panic_bounds_check
             at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/panicking.rs:162:5
   3: humility_dump_agent::DumpAgentExt::read_dump
             at /home/matt/humility/humility-dump-agent/src/lib.rs:536:35
   4: humility_cmd_dump::dump_via_agent
             at /home/matt/humility/cmd/dump/src/lib.rs:563:16
   5: humility_cmd_dump::dumpcmd
             at /home/matt/humility/cmd/dump/src/lib.rs:764:9
   6: humility::cmd::subcommand::{{closure}}
             at /home/matt/humility/src/cmd.rs:100:17
   7: humility_cmd::attach
             at /home/matt/humility/humility-cmd/src/lib.rs:173:5
   8: humility::cmd::subcommand
             at /home/matt/humility/src/cmd.rs:99:13
   9: humility::main
             at /home/matt/humility/src/main.rs:46:23
  10: core::ops::function::FnOnce::call_once
             at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/ops/function.rs:250:5

It's mad because there are no dumps (according to humility dump --list):

humility: using UDP dump agent
AREA TASK                  TIME       SIZE

SP-only dumps work fine, so it's the RoT portion of the system that's unhappy. However, it's not returning any errors in the sprot channel (or at least, nothing is being reported upstack).

This was seen on both a Madrid and customer Gimlet.

leftwo commented 1 year ago

On a 1.0.3 system, via the jumphost:

support@oxz_switch:/tmp$ ./humility --archive gimlet_sp-gimlet-e-1.0.3.tar.gz  -i fe80::aa40:25ff:fe04:5c1%gimlet10 dump --task=sprot --initialize-dump-agent
humility: connecting to fe80::aa40:25ff:fe04:5c1%gimlet10
humility: using UDP dump agent
humility dump failed: invalid response: Err(DumpAreaInUse)
augustuswm commented 1 year ago

Seeing the same thread 'main' panicked at 'index out of bounds: the len is 0 but the index is 0', /home/matt/humility/humility-dump-agent/src/lib.rs:536:35 error on Rack 3. Attempting to dump a single task (as Alan did above) also fails. Single task dumps do work though by running humility -a <archive> --ip <sled> dump --initialize-dump-agent on its own first, followed by the task request.

bcantrill commented 1 year ago

As with many things, there is a lot going on here.

First, I don't think this is what is happening on Rack 3, but the attest task in the RoT is panic looping:

$ humility tasks
humility: attached to 1fc9:0143:PVQ3SGS11OJT3 via CMSIS-DAP
system time = 36470
ID TASK                       GEN PRI STATE    
 0 jefe                         0   0 recv, notif: fault timer(T+30)
 1 hiffy                        0   6 ready
 2 idle                         0   7 ready
 3 update_server                0   3 recv
 4 syscon_driver                0   2 recv
 5 gpio_driver                  0   3 recv
 6 sprot                        0   6 ready
 7 swd                          0   4 recv
 8 dumper                       0   5 recv
 9 sp_measure                   0   6 not started
10 attest                   14621   5 RUNNING

This is due to a stack overflow:

$ humility jefe -H attest
humility: attached to 1fc9:0143:PVQ3SGS11OJT3 via CMSIS-DAP
humility: successfully changed disposition for attest
$ humility tasks attest
humility: attached to 1fc9:0143:PVQ3SGS11OJT3 via CMSIS-DAP
system time = 205098
ID TASK                       GEN PRI STATE    
10 attest                   79417   5 FAULT: stack overflow; sp=0x20008b18 (was: ready)

Making the stack size much larger and then running humility stackmargin reveals that the current size (9304) is pretty far off the maximum depth:

$ humility stackmargin
humility: attached to 1fc9:0143:PVQ3SGS11OJT3 via CMSIS-DAP
ID TASK                STACKBASE  STACKSIZE   MAXDEPTH     MARGIN
 0 jefe               0x20006680       1536        296       1240
 1 hiffy              0x20008dc0       2048        616       1432
 2 idle               0x20005c80        256        256          0
 3 update_server      0x20007700       2560       1664        896
 4 syscon_driver      0x20005d80       1024        192        832
 5 gpio_driver        0x20006200       1024        256        768
 6 sprot              0x2000b680      16384       6704       9680
 7 swd                0x20005000       1000        328        672
 8 dumper             0x20008180       2600       1072       1528
 9 sp_measure         0x20006ce0       2048        104       1944
10 attest             0x20010420      19304       9984       9320

Alleviating this problem and then taking the dump reveals the reason for the failure:

$ humility ringbuf dumper
humility: attached to 1fc9:0143:PVQ3SGS11OJT3 via CMSIS-DAP
humility: ring buffer task_dumper::__RINGBUF in dumper:
 NDX LINE      GEN    COUNT PAYLOAD
   0  125       60        1 Writing(0x30020fc0, 0x8, 0xea8)
   1   99       60        1 ReadingRegister(0x0)
   2  125       60        1 Writing(0x30020fc8, 0x8, 0xeb0)
   3   99       60        1 ReadingRegister(0x0)
   4  125       60        1 Writing(0x30020fd0, 0x8, 0xeb8)
   5   99       60        1 ReadingRegister(0x0)
   6  125       60        1 Writing(0x30020fd8, 0x8, 0xec0)
   7   99       60        1 ReadingRegister(0x0)
   8  125       60        1 Writing(0x30020fe0, 0x8, 0xec8)
   9   99       60        1 ReadingRegister(0x0)
  10  125       60        1 Writing(0x30020fe8, 0x8, 0xed0)
  11   99       60        1 ReadingRegister(0x0)
  12  125       60        1 Writing(0x30020ff0, 0x8, 0xed8)
  13   99       60        1 ReadingRegister(0x0)
  14  138       60        1 Done(Err(OutOfRegisterSpace))
  15  166       60        1 Resumed

Note that it's attempting to dump register 0 over and over again. This is because the fix for #1532 changed the semantics of the code (which were admittedly a little odd) to no longer increment the register value. That's the root problem but there's another annoyance here: the RoT knew that the dump failed but in fact returned success -- which brought us on the collision course to the issue in Humility that is the actual panic (which should also be modified to bail a tad more cleanly).

So we need to fix the stack size for attest, restore the correct register dumping semantics, propagate an error to denote a dump failure, and fix humility dump to bail a little more cleanly. PRs coming!