oxidecomputer / hubris

A lightweight, memory-protected, message-passing kernel for deeply embedded systems.
Mozilla Public License 2.0
3.01k stars 174 forks source link

RoT reports timeout if not power-cycled after update #1451

Closed mkeeter closed 1 year ago

mkeeter commented 1 year ago

After using wicket to update an RoT (from slot A to slot B), we observe that the RoT reports a timeout when attempting to read its status. After using ignition to power-cycle of the whole board, it properly comes up as running from slot B.

mkeeter commented 1 year ago

I can't reproduce this using faux-mgs on the benchtop.

faux-mgs update rot 1 /Users/mjk/oxide/hubris/target/rot-carrier/dist/b/build-rot-carrier-image-b.zip
faux-mgs state # works fine
faux-mgs component-active-slot rot --set 1 --persist
faux-mgs state # still works fine, shows B as pending persistent boot preference
faux-mgs reset-component rot
faux-mgs state # shows B as active and persistent boot preference

I'm going to look into what wicket is actually doing here and see if it's different.

mkeeter commented 1 year ago

Wicket is doing basically the same thing here. However, it may be doing it faster.

After resetting the RoT, we need to sleep for ~3 seconds before it will reply; we'll see the Timeout error message until then:

./target/debug/faux-mgs --interface en0 reset-component rot && ./target/debug/faux-mgs --interface en0 state
Jul 10 14:01:10.959 INFO creating SP handle on interface en0, component: faux-mgs
Jul 10 14:01:12.048 INFO initial discovery complete, addr: [fe80::c1d:ddff:fef8:fb69%15]:11111, interface: en0, component: faux-mgs
Jul 10 14:01:12.051 INFO SP is repared to reset component rot, component: faux-mgs
Jul 10 14:01:12.065 INFO SP reset component rot complete, component: faux-mgs
reset complete
Jul 10 14:01:12.081 INFO creating SP handle on interface en0, component: faux-mgs
Jul 10 14:01:12.086 INFO initial discovery complete, addr: [fe80::c1d:ddff:fef8:fb69%15]:11111, interface: en0, component: faux-mgs
Jul 10 14:01:12.132 INFO V2(SpStateV2 { hubris_archive_id: [75, 16, 227, 121, 150, 189, 78, 158], serial_number: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], model: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], revision: 0, base_mac_address: [14, 29, 221, 248, 251, 105], power_state: A2, rot: Err(Sprot(Timeout)) }), component: faux-mgs
hubris archive: 4b10e37996bd4e9e
serial number:
model:
revision: 0
base MAC address: 0e:1d:dd:f8:fb:69
power state: A2
RoT state: Err(Sprot(Timeout))
andrewjstone commented 1 year ago

I was able to reproduce this by putting ping back in the rot-carrier image for some chaos monkey behavior while performing an update. I get an Err(Sprot(Timeout)) every time from faux-mgs. Interestingly, looking at the ringbuf, the RoT side of sprot was not getting any messages!

I then decided to run tasks on the rot-carrier and low and behold:

➜  hubris git:(master) ✗ rota tasks
humility: attached to 1fc9:0143:CW4N3IFK22UUX via CMSIS-DAP
humility tasks failed: kernel has panicked on boot: "panicked at 'attempt to add with overflow', lib/lpc55-rot-startup/src/images.rs:108:13"
➜  hubris git:(master) ✗ rotb tasks
humility: attached to 1fc9:0143:CW4N3IFK22UUX via CMSIS-DAP
humility tasks failed: kernel has panicked on boot: "panicked at 'attempt to add with overflow', lib/lpc55-rot-startup/src/images.rs:108:13"

rota and rotb are just humility aliases pointing to the correct images.

I then ran rota registers and saw the kernel has indeed panicked.

➜  hubris git:(master) ✗ rota registers
humility: attached to 1fc9:0143:CW4N3IFK22UUX via CMSIS-DAP
   R0 = 0x00000000
   R1 = 0x2000445f <- kernel: KERNEL_EPITAPH+0x56
   R2 = 0x2000445f <- kernel: KERNEL_EPITAPH+0x56
   R3 = 0x2000445f <- kernel: KERNEL_EPITAPH+0x56
   R4 = 0x00050000
   R5 = 0x00050130
   R6 = 0x00000000
   R7 = 0x20004278 <- kernel: 0x20004000+0x278
   R8 = 0x00000000
   R9 = 0x00050000
  R10 = 0x00000000
  R11 = 0x00000000
  R12 = 0x20004458 <- kernel: KERNEL_EPITAPH+0x4f
   SP = 0x20004248 <- kernel: 0x20004000+0x248
   LR = 0x00012451 <- kernel: write_str+0x35
   PC = 0x0001276c <- kernel: die_impl+0x44
  PSR = 0x69000000 <- 0110_1001_0000_0000_0000_0000_0000_0000
                      |||| | ||         |       |           |
                      |||| | ||         |       |           + Exception = 0x0
                      |||| | ||         |       +------------ IC/IT = 0x0
                      |||| | ||         +-------------------- GE = 0x0
                      |||| | |+------------------------------ T = 1
                      |||| | +------------------------------- IC/IT = 0x0
                      |||| +--------------------------------- Q = 1
                      |||+----------------------------------- V = 0
                      ||+------------------------------------ C = 1
                      |+------------------------------------- Z = 1
                      +-------------------------------------- N = 0

  MSP = 0x20004248 <- kernel: 0x20004000+0x248
  PSP = 0x00000000
  SPR = 0x00000000 <- 0000_0000_0000_0000_0000_0000_0000_0000
                            |||         |         |         |
                            |||         |         |         + PRIMASK = 0
                            |||         |         +---------- BASEPRI = 0x0
                            |||         +-------------------- FAULTMASK = 0
                            ||+------------------------------ CONTROL.nPRIV = 0
                            |+------------------------------- CONTROL.SPSEL = 0
                            +-------------------------------- CONTROL.FPCA = 0

FPSCR = 0x2000445f <- kernel: KERNEL_EPITAPH+0x56
andrewjstone commented 1 year ago

Well, this is clearly not the same issue, because rebooting the RoT doesn't fix it. I have no idea how I corrupted the image.

andrewjstone commented 1 year ago

Well, this is clearly not the same issue, because rebooting the RoT doesn't fix it. I have no idea how I corrupted the image.

Even stranger, I try to reflash via humility and am seeing this:

➜  hubris git:(master) ✗ rota flash
humility: attaching with chip set to "LPC55S69JBD100"
humility: attached to 1fc9:0143:CW4N3IFK22UUX via CMSIS-DAP
humility flash failed: archive appears to be already flashed on attached device; use -F ("--force") to force re-flash or -V ("--verify") to verify contents
➜  hubris git:(master) ✗ rotb flash
humility: attaching with chip set to "LPC55S69JBD100"
humility: attached to 1fc9:0143:CW4N3IFK22UUX via CMSIS-DAP
humility flash failed: archive appears to be already flashed on attached device; use -F ("--force") to force re-flash or -V ("--verify") to verify contents
andrewjstone commented 1 year ago

AHA! What happened was that I ran the following:

 target/release/faux-mgs --interface igb0 state && target/release/faux-mgs --interface igb0 update rot 1 ~/build-rot-carrier-image-b.zip && target/release/faux-mgs --interface igb0 component-active-slot -s 1 -p rot && target/release/faux-mgs --interface igb0 reset-component rot

The rot update to slot b failed with a FlowError. But faux-mgs doesn't report that in the shell as an error so then I switched to slot B forcefully but it was corrupted. The weird thing is that I reflash slot A from humility, but it didn't appear to get booted into directly, even though B was corrupt. Maybe that's because it was just stuck after the kernel panic.

andrewjstone commented 1 year ago

Welp, A new wrinkle has arisen. I have removed the PEBKAC by not switching to the image I am purposefully corrupting. So I ran in an install via faux-mgs of slot A, which failed. I then reset and ended back up in a kernel panic, same as above. However, slot B is valid and the persistent boot preference. Power-cycling by removing the power cable did not fix the issue even. So I reflashed slot A via humility, and then when the system came back it was in slot B. You can see this after I flashed slot A and did nothing else:

➜  hubris git:(master) ✗ gimletlet hiffy -c SpRot.rot_boot_info
humility: attached to 0483:3754:005100383431511731343632 via ST-Link V3
SpRot.rot_boot_info() => RotBootInfo { active: B, persistent_boot_preference: B, pending_persistent_boot_preference: None, transient_boot_preference: None, slot_a_sha3_256_digest: Some([ 0x9b, 0x9b, 0xa0, 0x2a, 0x94, 0x49, 0xf9, 0x49, 0x5e, 0xd8, 0x34, 0x3f, 0xe7, 0x94, 0xa, 0x7e, 0x0, 0x99, 0x6c, 0x91, 0xe3, 0x3e, 0xaf, 0xcf, 0xa, 0xfd, 0x8b, 0xb9, 0x9d, 0x4, 0x36, 0x84 ]), slot_b_sha3_256_digest: Some([ 0x3, 0x98, 0x12, 0xd0, 0x5c, 0x8d, 0xc4, 0xfb, 0xd3, 0x13, 0x32, 0x85, 0x1, 0xa7, 0xbc, 0x9d, 0x4a, 0xb7, 0x50, 0x3c, 0x78, 0x1d, 0x69, 0xe5, 0xbd, 0x2f, 0x49, 0x25, 0x12, 0x88, 0x6a, 0xbc ]) }

My best guess is that we are crashing because the romapi is not doing it's job in lpc55-rot-startup and is returning a header as programmed when it isn't.

labbott commented 1 year ago

I do have #1504 which uses our flash driver for checking if things are unprogrammed if you want to rule out the ROM API

labbott commented 1 year ago
        let valid = lpc55_romapi::validate_programmed(
            img_start,
            (header.total_image_len + (PAGE_SIZE - 1)) & !(PAGE_SIZE - 1),
        );

Oooh yeah this could definitely overflow if we're using a corrupt image. I think the correct fix is to check the magic before trying to access the length and also switch to the check APIs for doing math.

nathanaelhuffman commented 1 year ago

Just commenting here that we see this in rack after doing an RoT-only reset via pilot: pilot sp exec -e 'reset-component rot' BRM27230001

after doing this we get the timeouts:

nathanael@drteeth ~ $ pilot sp info  BRM27230001
hubris archive: 6cb4ed394fc7b36c
serial number: BRM27230001
model: 913-0000019
revision: 13
base MAC address: a8:40:25:04:05:c1
power state: A0
RoT state: Err(Sprot(Timeout))

and after a power cycle it's back to working:

nathanael@drteeth ~ $ pilot sp info  BRM27230001                                                  hubris archive: 6cb4ed394fc7b36c
serial number: BRM27230001
model: 913-0000019
revision: 13
base MAC address: a8:40:25:04:05:c1
power state: A0
RoT state: Ok(RotStateV2 { active: B, persistent_boot_preference: B, pending_persistent_boot_preference: None, transient_boot_preference: None, slot_a_sha3_256_digest: Some([225, 90, 3, 166, 133, 26, 105, 129, 123, 135, 6, 55, 26, 48, 101, 113, 101, 162, 203, 199, 162, 151, 250, 188, 164, 159, 205, 201, 142, 227, 76, 46]), slot_b_sha3_256_digest: Some([246, 2, 155, 45, 220, 12, 152, 100, 177, 189, 42, 86, 113, 96, 65, 26, 74, 86, 216, 206, 179, 138, 198, 145, 18, 213, 189, 132, 235, 7, 140, 213]) })
jgallagher commented 1 year ago

Confirming @nathanaelhuffman's observation. During the mupdate of the PVT2 rack, I updated the RoTs of all 16 sleds, 1 psc, and 2 sidecars, and all 19 went into Sprot(Timeout) following the reset-after-update. Ignition power cycling each component brought it back (and the RoT booted into the slot we expected following the update).

mkeeter commented 1 year ago

Testing on animal and a spare locked Gimlet, I can get this to happen 100% of the time by resetting a locked RoT (SpRot.reset) with a CFPA update staged (i.e. we have written a new CFPA to the scratch page and are waiting for the ROM to copy it over, using SpRot.switch_default_image -aslot=A,duration=Forever).

Simply resetting a locked RoT does not reproduce the issue, and resetting an unlocked RoT with a CFPA update staged doesn't either (see the very top of this thread).

Once the RoT is in this bad state, sending a CS pulse (with SpRot.pulse_cs -adelay=10) does not recover it. This lends credence to the theory that it's not making it into Hubris at all.

Details in Matrix starting about here

labbott commented 1 year ago

Fixed by #1518. Cause was that the standard system reset did not clear the flash lock registers so we have to use a different reset. This was mentioned in one line of the 1229 page manual (im not bitter not at all).