Spurious update failures

mkeeter commented 1 month ago

On both colo and dogfood, we've seen SP update failures when updating to R11.

https://github.com/oxidecomputer/colo/issues/88

Logs and Hubris dumps are in /staff/rack3/BRM42220064/2024-10-18

This failure is common, but does not occur 100% of the time. When force-updating from R11 to R11 on the bench, @jgallagher and @labbott could not reproduce the issue.

The failure logs consistently show the same thing:

updating SP SpIdentifier { typ: Sled, slot: 14 } failed: failed to send update message to SP: Error response from SP: update failed (code 7)

This represents SpCommsError::UpdateFailed(UpdateError::CommunicationError(CommunicationError::SpError(SpError::UpdatedFailed(7)))

The 7 code is not strongly typed, but from auditing error types that get cast into the u32, it's most likely UpdateError::ReadProtErr (this is Hubris's internal UpdateError type, not the MGS UpdateError).

This agrees with the ringbuf, which shows no progress after EraseEnd:

humility: ring buffer stm32h7_update_server::__RINGBUF in update_server:
 NDX LINE      GEN    COUNT PAYLOAD
   0  247        1        1 EraseStart
   1  283        1        1 EraseEnd
   2  247        1        1 EraseStart
   3  283        1        1 EraseEnd
   4  247        1        1 EraseStart
   5  283        1        1 EraseEnd
   6  247        1        1 EraseStart
   7  283        1        1 EraseEnd

At the end of bank_erase, Hubris checks the status flags for bank 2 (bank2_status) and returns an error if any of them are set.

In other words, it seems likely that the RDPERR bit is set in the bank 2 status bits.

We never enable read protection, so it's unclear how this bit could end up being set.

Spontaneously set flags have been reported on the ST forums and among other embedded OSes.

The Zephyr issue at https://github.com/zephyrproject-rtos/zephyr/issues/60449 is a good summary.

Zephyr manages to see this issue by just sleeping (see main.c). Note that the sleep syscall goes into the kernel, so there's stuff happening under the hood, but not much!

In the ST forum, the issue is diagnosed as follows:

the Cortex-M7's prefetch unit is performing speculative instruction fetches to a (partially undocumented) region of system memory (0x1FF0 0000 - 0x1FF7 FFFF) decoded by the flash controller, which sets RDSERR in protest. Mapping this region as Execute-Never with the MPU inhibits the speculative fetch, preventing the RDSERR.

This raises more questions than answers:

Why would the microcontroller be pre-fetching system memory?
What has changed in our latest Hubris (or host OS) release to make this happen more often?
Can we know that it's a system memory prefecth causing the error on Hubris?
We're seeing RDPERR, not RDSERR. Do they have the same root case?
Why don't we see a bus fault?

Zephyr eliminated the error by dedicating an MPU region to system memory, which is evidence for this theory. It's unclear whether that would be feasible for us (some of our tasks are already using every MPU region), or whether we should expect our usual memory protection to have the same effect.

We can kinda reproduce the issue by issuing reads to system memory using humility readmem.

Here's an example of reading from system flash (0x1FF02000) then seeing a flag set in FLASH_SR2 (0x52002110):

➜  ~ h readmem -w 0x52002110 4
humility: attached via ST-Link V3
                   \/        4        8        c
0x52002110 | 00000000                            | ....
➜  ~ h readmem -w 0x1FF02000 4
humility: attached via ST-Link V3
                   \/        4        8        c
0x1ff02000 | 00000000                            | ....
➜  ~ h readmem -w 0x52002110 4
humility: attached via ST-Link V3
                   \/        4        8        c
0x52002110 | 01000000                            | ....

Note that this sets the RDSERRIE, not the RDPERR bit, so it's not quite the same as our test.

There are two flash status registers: FLASH_SR1 and FLASH_SR2. The error flag is set in either FLASH_SR1 or FLASH_SR2, depending on whether we are running in bank-swapped mode; this is one of the few cases where bank swapping is visible. This also means that our Hubris check is wrong, because it always looks at FLASH_SR2.

Running the same test after switching into the other bank, the flag is set in FLASH_SR1 (0x52002010) instead of FLASH_SR2:

➜  ~ h readmem -w 0x52002010 4
humility: attached via ST-Link V3
                   \/        4        8        c
0x52002010 | 00000000                            | ....
➜  ~ h readmem -w 0x52002110 4
humility: attached via ST-Link V3
                   \/        4        8        c
0x52002110 | 00000000                            | ....
➜  ~ h readmem -w 0x1FF02000 4
humility: attached via ST-Link V3
                   \/        4        8        c
0x1ff02000 | 00000000                            | ....
➜  ~ h readmem -w 0x52002010 4
humility: attached via ST-Link V3
                   \/        4        8        c
0x52002010 | 01000000                            | ....
➜  ~ h readmem -w 0x52002110 4
humility: attached via ST-Link V3
                   \/        4        8        c
0x52002110 | 00000000                            | ....

Miscellaneous observations

The Proprietary code readout protection functionality is noted to raise error flags without generating bus errors:

No bus error is generated but an error flag is raised (RDPERR for read, WRPERR for write).

RM0433, § 4.5.4

We're not using it, but who knows!

labbott commented 1 month ago

We saw this issue on dogfood too(!!!!!) and I did some experiments there. Power cycling via ignition (sp on/sp off/sp cycle) does not reliably clear the bit. component-reset sp does reliably clear the error bit. I re-ran the same tests on madrid which had been up for two days but I still could not reproduce it there.

I'm moderately suspicious this is an error that takes an extended time to show up but I think this at least gives a path forward to do updates.

labbott commented 1 month ago

It turns out igntition will work. I used manual power off/power on and gave it an extra few seconds.

We also saw this on a gimlet-e and one that gave a slightly different error code

 │            │ -> Error Response: status: 503 Service Unavailable; headers: {"content-type":                  │─────────────╮ 
 │            │ "application/json", "x-request-id": "73a92ec5-d3b7-4125-9a8b-7ebb7ff34b0e",                    │             │ 
 │            │ "content-length": "246", "date": "Tue, 22 Oct 2024 15:30:05 GMT"}; value: Error { error_code:  │─────────────┤ 
 │            │ Some("UpdateFailed"), message: "updating SP SpIdentifier { typ: Sled, slot: 7 } failed: failed │             │ 
 │            │ to send update message to SP: Error response from SP: update failed (code 6)", request_id:     │             │ 
 │            │ "73a92ec5-d3b7-4125-9a8b-7ebb7ff34b0e" }

labbott commented 1 month ago

For reference if we poke a different area in the system region we get the error code that matches

laura@lurch ~ $ pfexec humility -t sn66 readmem -w 0x1FF20000 4                
humility: attached to 0483:3754:003F00164741500920383733 via ST-Link V3
                   \/        4        8        c
0x1ff20000 | 00000000                            | ....            
laura@lurch ~ $ pfexec humility -t sn66 readmem -w 0x52002110 4                
humility: attached to 0483:3754:003F00164741500920383733 via ST-Link V3
                   \/        4        8        c
0x52002110 | 00800000

labbott commented 1 month ago

Turns out if you look in the latest Cortex M7 manual https://developer.arm.com/documentation/ddi0489/f/memory-system/speculative-accesses/considerations-for-system-design the chip does speculate. We're going with https://github.com/oxidecomputer/hubris/pull/1905 as our workaround.

labbott commented 1 week ago

closed as we merged #1905

oxidecomputer / hubris

Spurious update failures #1908

Miscellaneous observations