Closed mkeeter closed 1 week ago
We saw this issue on dogfood too(!!!!!) and I did some experiments there. Power cycling via ignition (sp on
/sp off
/sp cycle
) does not reliably clear the bit. component-reset sp
does reliably clear the error bit. I re-ran the same tests on madrid which had been up for two days but I still could not reproduce it there.
I'm moderately suspicious this is an error that takes an extended time to show up but I think this at least gives a path forward to do updates.
It turns out igntition will work. I used manual power off/power on and gave it an extra few seconds.
We also saw this on a gimlet-e
and one that gave a slightly different error code
│ │ -> Error Response: status: 503 Service Unavailable; headers: {"content-type": │─────────────╮
│ │ "application/json", "x-request-id": "73a92ec5-d3b7-4125-9a8b-7ebb7ff34b0e", │ │
│ │ "content-length": "246", "date": "Tue, 22 Oct 2024 15:30:05 GMT"}; value: Error { error_code: │─────────────┤
│ │ Some("UpdateFailed"), message: "updating SP SpIdentifier { typ: Sled, slot: 7 } failed: failed │ │
│ │ to send update message to SP: Error response from SP: update failed (code 6)", request_id: │ │
│ │ "73a92ec5-d3b7-4125-9a8b-7ebb7ff34b0e" }
For reference if we poke a different area in the system region we get the error code that matches
laura@lurch ~ $ pfexec humility -t sn66 readmem -w 0x1FF20000 4
humility: attached to 0483:3754:003F00164741500920383733 via ST-Link V3
\/ 4 8 c
0x1ff20000 | 00000000 | ....
laura@lurch ~ $ pfexec humility -t sn66 readmem -w 0x52002110 4
humility: attached to 0483:3754:003F00164741500920383733 via ST-Link V3
\/ 4 8 c
0x52002110 | 00800000
Turns out if you look in the latest Cortex M7 manual https://developer.arm.com/documentation/ddi0489/f/memory-system/speculative-accesses/considerations-for-system-design the chip does speculate. We're going with https://github.com/oxidecomputer/hubris/pull/1905 as our workaround.
closed as we merged #1905
On both colo and dogfood, we've seen SP update failures when updating to R11.
https://github.com/oxidecomputer/colo/issues/88
Logs and Hubris dumps are in
/staff/rack3/BRM42220064/2024-10-18
This failure is common, but does not occur 100% of the time. When force-updating from R11 to R11 on the bench, @jgallagher and @labbott could not reproduce the issue.
The failure logs consistently show the same thing:
This represents
SpCommsError::UpdateFailed(UpdateError::CommunicationError(CommunicationError::SpError(SpError::UpdatedFailed(7)))
The
7
code is not strongly typed, but from auditing error types that get cast into theu32
, it's most likelyUpdateError::ReadProtErr
(this is Hubris's internalUpdateError
type, not the MGSUpdateError
).This agrees with the ringbuf, which shows no progress after
EraseEnd
:At the end of
bank_erase
, Hubris checks the status flags for bank 2 (bank2_status
) and returns an error if any of them are set.In other words, it seems likely that the RDPERR bit is set in the bank 2 status bits.
We never enable read protection, so it's unclear how this bit could end up being set.
Spontaneously set flags have been reported on the ST forums and among other embedded OSes.
The Zephyr issue at https://github.com/zephyrproject-rtos/zephyr/issues/60449 is a good summary.
Zephyr manages to see this issue by just sleeping (see
main.c
). Note that the sleep syscall goes into the kernel, so there's stuff happening under the hood, but not much!In the ST forum, the issue is diagnosed as follows:
This raises more questions than answers:
RDPERR
, notRDSERR
. Do they have the same root case?Zephyr eliminated the error by dedicating an MPU region to system memory, which is evidence for this theory. It's unclear whether that would be feasible for us (some of our tasks are already using every MPU region), or whether we should expect our usual memory protection to have the same effect.
We can kinda reproduce the issue by issuing reads to system memory using
humility readmem
.Here's an example of reading from system flash (
0x1FF02000
) then seeing a flag set inFLASH_SR2
(0x52002110
):Note that this sets the
RDSERRIE
, not theRDPERR
bit, so it's not quite the same as our test.There are two flash status registers:
FLASH_SR1
andFLASH_SR2
. The error flag is set in eitherFLASH_SR1
orFLASH_SR2
, depending on whether we are running in bank-swapped mode; this is one of the few cases where bank swapping is visible. This also means that our Hubris check is wrong, because it always looks atFLASH_SR2
.Running the same test after switching into the other bank, the flag is set in
FLASH_SR1
(0x52002010
) instead ofFLASH_SR2
:Miscellaneous observations
The Proprietary code readout protection functionality is noted to raise error flags without generating bus errors:
RM0433, § 4.5.4
We're not using it, but who knows!