Closed mkeeter closed 1 year ago
I can't reproduce this using faux-mgs
on the benchtop.
faux-mgs update rot 1 /Users/mjk/oxide/hubris/target/rot-carrier/dist/b/build-rot-carrier-image-b.zip
faux-mgs state # works fine
faux-mgs component-active-slot rot --set 1 --persist
faux-mgs state # still works fine, shows B as pending persistent boot preference
faux-mgs reset-component rot
faux-mgs state # shows B as active and persistent boot preference
I'm going to look into what wicket
is actually doing here and see if it's different.
Wicket is doing basically the same thing here. However, it may be doing it faster.
After resetting the RoT, we need to sleep for ~3 seconds before it will reply; we'll see the Timeout
error message until then:
./target/debug/faux-mgs --interface en0 reset-component rot && ./target/debug/faux-mgs --interface en0 state
Jul 10 14:01:10.959 INFO creating SP handle on interface en0, component: faux-mgs
Jul 10 14:01:12.048 INFO initial discovery complete, addr: [fe80::c1d:ddff:fef8:fb69%15]:11111, interface: en0, component: faux-mgs
Jul 10 14:01:12.051 INFO SP is repared to reset component rot, component: faux-mgs
Jul 10 14:01:12.065 INFO SP reset component rot complete, component: faux-mgs
reset complete
Jul 10 14:01:12.081 INFO creating SP handle on interface en0, component: faux-mgs
Jul 10 14:01:12.086 INFO initial discovery complete, addr: [fe80::c1d:ddff:fef8:fb69%15]:11111, interface: en0, component: faux-mgs
Jul 10 14:01:12.132 INFO V2(SpStateV2 { hubris_archive_id: [75, 16, 227, 121, 150, 189, 78, 158], serial_number: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], model: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], revision: 0, base_mac_address: [14, 29, 221, 248, 251, 105], power_state: A2, rot: Err(Sprot(Timeout)) }), component: faux-mgs
hubris archive: 4b10e37996bd4e9e
serial number:
model:
revision: 0
base MAC address: 0e:1d:dd:f8:fb:69
power state: A2
RoT state: Err(Sprot(Timeout))
I was able to reproduce this by putting ping
back in the rot-carrier image for some chaos monkey behavior while performing an update. I get an Err(Sprot(Timeout))
every time from faux-mgs
. Interestingly, looking at the ringbuf, the RoT side of sprot
was not getting any messages!
I then decided to run tasks
on the rot-carrier and low and behold:
➜ hubris git:(master) ✗ rota tasks
humility: attached to 1fc9:0143:CW4N3IFK22UUX via CMSIS-DAP
humility tasks failed: kernel has panicked on boot: "panicked at 'attempt to add with overflow', lib/lpc55-rot-startup/src/images.rs:108:13"
➜ hubris git:(master) ✗ rotb tasks
humility: attached to 1fc9:0143:CW4N3IFK22UUX via CMSIS-DAP
humility tasks failed: kernel has panicked on boot: "panicked at 'attempt to add with overflow', lib/lpc55-rot-startup/src/images.rs:108:13"
rota
and rotb
are just humility aliases pointing to the correct images.
I then ran rota registers
and saw the kernel has indeed panicked.
➜ hubris git:(master) ✗ rota registers
humility: attached to 1fc9:0143:CW4N3IFK22UUX via CMSIS-DAP
R0 = 0x00000000
R1 = 0x2000445f <- kernel: KERNEL_EPITAPH+0x56
R2 = 0x2000445f <- kernel: KERNEL_EPITAPH+0x56
R3 = 0x2000445f <- kernel: KERNEL_EPITAPH+0x56
R4 = 0x00050000
R5 = 0x00050130
R6 = 0x00000000
R7 = 0x20004278 <- kernel: 0x20004000+0x278
R8 = 0x00000000
R9 = 0x00050000
R10 = 0x00000000
R11 = 0x00000000
R12 = 0x20004458 <- kernel: KERNEL_EPITAPH+0x4f
SP = 0x20004248 <- kernel: 0x20004000+0x248
LR = 0x00012451 <- kernel: write_str+0x35
PC = 0x0001276c <- kernel: die_impl+0x44
PSR = 0x69000000 <- 0110_1001_0000_0000_0000_0000_0000_0000
|||| | || | | |
|||| | || | | + Exception = 0x0
|||| | || | +------------ IC/IT = 0x0
|||| | || +-------------------- GE = 0x0
|||| | |+------------------------------ T = 1
|||| | +------------------------------- IC/IT = 0x0
|||| +--------------------------------- Q = 1
|||+----------------------------------- V = 0
||+------------------------------------ C = 1
|+------------------------------------- Z = 1
+-------------------------------------- N = 0
MSP = 0x20004248 <- kernel: 0x20004000+0x248
PSP = 0x00000000
SPR = 0x00000000 <- 0000_0000_0000_0000_0000_0000_0000_0000
||| | | |
||| | | + PRIMASK = 0
||| | +---------- BASEPRI = 0x0
||| +-------------------- FAULTMASK = 0
||+------------------------------ CONTROL.nPRIV = 0
|+------------------------------- CONTROL.SPSEL = 0
+-------------------------------- CONTROL.FPCA = 0
FPSCR = 0x2000445f <- kernel: KERNEL_EPITAPH+0x56
Well, this is clearly not the same issue, because rebooting the RoT doesn't fix it. I have no idea how I corrupted the image.
Well, this is clearly not the same issue, because rebooting the RoT doesn't fix it. I have no idea how I corrupted the image.
Even stranger, I try to reflash via humility and am seeing this:
➜ hubris git:(master) ✗ rota flash
humility: attaching with chip set to "LPC55S69JBD100"
humility: attached to 1fc9:0143:CW4N3IFK22UUX via CMSIS-DAP
humility flash failed: archive appears to be already flashed on attached device; use -F ("--force") to force re-flash or -V ("--verify") to verify contents
➜ hubris git:(master) ✗ rotb flash
humility: attaching with chip set to "LPC55S69JBD100"
humility: attached to 1fc9:0143:CW4N3IFK22UUX via CMSIS-DAP
humility flash failed: archive appears to be already flashed on attached device; use -F ("--force") to force re-flash or -V ("--verify") to verify contents
AHA! What happened was that I ran the following:
target/release/faux-mgs --interface igb0 state && target/release/faux-mgs --interface igb0 update rot 1 ~/build-rot-carrier-image-b.zip && target/release/faux-mgs --interface igb0 component-active-slot -s 1 -p rot && target/release/faux-mgs --interface igb0 reset-component rot
The rot update to slot b failed with a FlowError. But faux-mgs doesn't report that in the shell as an error so then I switched to slot B forcefully but it was corrupted. The weird thing is that I reflash slot A from humility, but it didn't appear to get booted into directly, even though B was corrupt. Maybe that's because it was just stuck after the kernel panic.
Welp, A new wrinkle has arisen. I have removed the PEBKAC by not switching to the image I am purposefully corrupting. So I ran in an install via faux-mgs of slot A, which failed. I then reset and ended back up in a kernel panic, same as above. However, slot B is valid and the persistent boot preference. Power-cycling by removing the power cable did not fix the issue even. So I reflashed slot A via humility, and then when the system came back it was in slot B. You can see this after I flashed slot A and did nothing else:
➜ hubris git:(master) ✗ gimletlet hiffy -c SpRot.rot_boot_info
humility: attached to 0483:3754:005100383431511731343632 via ST-Link V3
SpRot.rot_boot_info() => RotBootInfo { active: B, persistent_boot_preference: B, pending_persistent_boot_preference: None, transient_boot_preference: None, slot_a_sha3_256_digest: Some([ 0x9b, 0x9b, 0xa0, 0x2a, 0x94, 0x49, 0xf9, 0x49, 0x5e, 0xd8, 0x34, 0x3f, 0xe7, 0x94, 0xa, 0x7e, 0x0, 0x99, 0x6c, 0x91, 0xe3, 0x3e, 0xaf, 0xcf, 0xa, 0xfd, 0x8b, 0xb9, 0x9d, 0x4, 0x36, 0x84 ]), slot_b_sha3_256_digest: Some([ 0x3, 0x98, 0x12, 0xd0, 0x5c, 0x8d, 0xc4, 0xfb, 0xd3, 0x13, 0x32, 0x85, 0x1, 0xa7, 0xbc, 0x9d, 0x4a, 0xb7, 0x50, 0x3c, 0x78, 0x1d, 0x69, 0xe5, 0xbd, 0x2f, 0x49, 0x25, 0x12, 0x88, 0x6a, 0xbc ]) }
My best guess is that we are crashing because the romapi is not doing it's job in lpc55-rot-startup
and is returning a header as programmed when it isn't.
I do have #1504 which uses our flash driver for checking if things are unprogrammed if you want to rule out the ROM API
let valid = lpc55_romapi::validate_programmed(
img_start,
(header.total_image_len + (PAGE_SIZE - 1)) & !(PAGE_SIZE - 1),
);
Oooh yeah this could definitely overflow if we're using a corrupt image. I think the correct fix is to check the magic before trying to access the length and also switch to the check APIs for doing math.
Just commenting here that we see this in rack after doing an RoT-only reset via pilot:
pilot sp exec -e 'reset-component rot' BRM27230001
after doing this we get the timeouts:
nathanael@drteeth ~ $ pilot sp info BRM27230001
hubris archive: 6cb4ed394fc7b36c
serial number: BRM27230001
model: 913-0000019
revision: 13
base MAC address: a8:40:25:04:05:c1
power state: A0
RoT state: Err(Sprot(Timeout))
and after a power cycle it's back to working:
nathanael@drteeth ~ $ pilot sp info BRM27230001 hubris archive: 6cb4ed394fc7b36c
serial number: BRM27230001
model: 913-0000019
revision: 13
base MAC address: a8:40:25:04:05:c1
power state: A0
RoT state: Ok(RotStateV2 { active: B, persistent_boot_preference: B, pending_persistent_boot_preference: None, transient_boot_preference: None, slot_a_sha3_256_digest: Some([225, 90, 3, 166, 133, 26, 105, 129, 123, 135, 6, 55, 26, 48, 101, 113, 101, 162, 203, 199, 162, 151, 250, 188, 164, 159, 205, 201, 142, 227, 76, 46]), slot_b_sha3_256_digest: Some([246, 2, 155, 45, 220, 12, 152, 100, 177, 189, 42, 86, 113, 96, 65, 26, 74, 86, 216, 206, 179, 138, 198, 145, 18, 213, 189, 132, 235, 7, 140, 213]) })
Confirming @nathanaelhuffman's observation. During the mupdate of the PVT2 rack, I updated the RoTs of all 16 sleds, 1 psc, and 2 sidecars, and all 19 went into Sprot(Timeout)
following the reset-after-update. Ignition power cycling each component brought it back (and the RoT booted into the slot we expected following the update).
Testing on animal
and a spare locked Gimlet, I can get this to happen 100% of the time by resetting a locked RoT (SpRot.reset
) with a CFPA update staged (i.e. we have written a new CFPA to the scratch page and are waiting for the ROM to copy it over, using SpRot.switch_default_image -aslot=A,duration=Forever
).
Simply resetting a locked RoT does not reproduce the issue, and resetting an unlocked RoT with a CFPA update staged doesn't either (see the very top of this thread).
Once the RoT is in this bad state, sending a CS pulse (with SpRot.pulse_cs -adelay=10
) does not recover it. This lends credence to the theory that it's not making it into Hubris at all.
Details in Matrix starting about here
Fixed by #1518. Cause was that the standard system reset did not clear the flash lock registers so we have to use a different reset. This was mentioned in one line of the 1229 page manual (im not bitter not at all).
After using wicket to update an RoT (from slot A to slot B), we observe that the RoT reports a timeout when attempting to read its status. After using ignition to power-cycle of the whole board, it properly comes up as running from slot B.