Closed mkeeter closed 1 year ago
On a 1.0.3 system, via the jumphost:
support@oxz_switch:/tmp$ ./humility --archive gimlet_sp-gimlet-e-1.0.3.tar.gz -i fe80::aa40:25ff:fe04:5c1%gimlet10 dump --task=sprot --initialize-dump-agent
humility: connecting to fe80::aa40:25ff:fe04:5c1%gimlet10
humility: using UDP dump agent
humility dump failed: invalid response: Err(DumpAreaInUse)
Seeing the same thread 'main' panicked at 'index out of bounds: the len is 0 but the index is 0', /home/matt/humility/humility-dump-agent/src/lib.rs:536:35
error on Rack 3. Attempting to dump a single task (as Alan did above) also fails. Single task dumps do work though by running humility -a <archive> --ip <sled> dump --initialize-dump-agent
on its own first, followed by the task request.
As with many things, there is a lot going on here.
First, I don't think this is what is happening on Rack 3, but the attest
task in the RoT is panic looping:
$ humility tasks
humility: attached to 1fc9:0143:PVQ3SGS11OJT3 via CMSIS-DAP
system time = 36470
ID TASK GEN PRI STATE
0 jefe 0 0 recv, notif: fault timer(T+30)
1 hiffy 0 6 ready
2 idle 0 7 ready
3 update_server 0 3 recv
4 syscon_driver 0 2 recv
5 gpio_driver 0 3 recv
6 sprot 0 6 ready
7 swd 0 4 recv
8 dumper 0 5 recv
9 sp_measure 0 6 not started
10 attest 14621 5 RUNNING
This is due to a stack overflow:
$ humility jefe -H attest
humility: attached to 1fc9:0143:PVQ3SGS11OJT3 via CMSIS-DAP
humility: successfully changed disposition for attest
$ humility tasks attest
humility: attached to 1fc9:0143:PVQ3SGS11OJT3 via CMSIS-DAP
system time = 205098
ID TASK GEN PRI STATE
10 attest 79417 5 FAULT: stack overflow; sp=0x20008b18 (was: ready)
Making the stack size much larger and then running humility stackmargin
reveals that the current size (9304) is pretty far off the maximum depth:
$ humility stackmargin
humility: attached to 1fc9:0143:PVQ3SGS11OJT3 via CMSIS-DAP
ID TASK STACKBASE STACKSIZE MAXDEPTH MARGIN
0 jefe 0x20006680 1536 296 1240
1 hiffy 0x20008dc0 2048 616 1432
2 idle 0x20005c80 256 256 0
3 update_server 0x20007700 2560 1664 896
4 syscon_driver 0x20005d80 1024 192 832
5 gpio_driver 0x20006200 1024 256 768
6 sprot 0x2000b680 16384 6704 9680
7 swd 0x20005000 1000 328 672
8 dumper 0x20008180 2600 1072 1528
9 sp_measure 0x20006ce0 2048 104 1944
10 attest 0x20010420 19304 9984 9320
Alleviating this problem and then taking the dump reveals the reason for the failure:
$ humility ringbuf dumper
humility: attached to 1fc9:0143:PVQ3SGS11OJT3 via CMSIS-DAP
humility: ring buffer task_dumper::__RINGBUF in dumper:
NDX LINE GEN COUNT PAYLOAD
0 125 60 1 Writing(0x30020fc0, 0x8, 0xea8)
1 99 60 1 ReadingRegister(0x0)
2 125 60 1 Writing(0x30020fc8, 0x8, 0xeb0)
3 99 60 1 ReadingRegister(0x0)
4 125 60 1 Writing(0x30020fd0, 0x8, 0xeb8)
5 99 60 1 ReadingRegister(0x0)
6 125 60 1 Writing(0x30020fd8, 0x8, 0xec0)
7 99 60 1 ReadingRegister(0x0)
8 125 60 1 Writing(0x30020fe0, 0x8, 0xec8)
9 99 60 1 ReadingRegister(0x0)
10 125 60 1 Writing(0x30020fe8, 0x8, 0xed0)
11 99 60 1 ReadingRegister(0x0)
12 125 60 1 Writing(0x30020ff0, 0x8, 0xed8)
13 99 60 1 ReadingRegister(0x0)
14 138 60 1 Done(Err(OutOfRegisterSpace))
15 166 60 1 Resumed
Note that it's attempting to dump register 0 over and over again. This is because the fix for #1532 changed the semantics of the code (which were admittedly a little odd) to no longer increment the register value. That's the root problem but there's another annoyance here: the RoT knew that the dump failed but in fact returned success -- which brought us on the collision course to the issue in Humility that is the actual panic (which should also be modified to bail a tad more cleanly).
So we need to fix the stack size for attest
, restore the correct register dumping semantics, propagate an error to denote a dump failure, and fix humility dump
to bail a little more cleanly. PRs coming!
Sometimes, when we ask the RoT to dump the SP, it doesn't do it.
It's mad because there are no dumps (according to
humility dump --list
):SP-only dumps work fine, so it's the RoT portion of the system that's unhappy. However, it's not returning any errors in the
sprot
channel (or at least, nothing is being reported upstack).This was seen on both a Madrid and customer Gimlet.