oxidecomputer / hubris

A lightweight, memory-protected, message-passing kernel for deeply embedded systems.
Mozilla Public License 2.0
3.04k stars 180 forks source link

ARMv6M still attributes faulting SVC to wrong task #1928

Closed cbiffle closed 1 week ago

cbiffle commented 1 week ago

We are still haunted by the ghost of #1134.

Background

As a refresher, that bug happened because

The "fix"

We fixed this on v7-M/v8-M by writing the SHCSR to un-pend SVC on any fault. Great. Works.

So here I am chasing down a subtle intermittent system failure on an M0+ chip (ARMv6-M) when I notice that I'm getting phantom SVCs, as though #1134 had not been fixed. 👻

Hitting the books, it appears from the ARMv6-M spec that...

So I think our fix for this on ARMv6-M may never have worked, and we may have fallen victim to one of the classic blunders --- the most famous of which is, ‘never get involved in a land war in Asia,’ but only slightly less well-known is this:

Never assume that an architectural feature subsetted from ARMv7-M into ARMv6-M can simply be used, and hasn't been subtly changed in some way! 💀

The fix for the "fix"

On v6-M, I think we're going to have to add a global variable and a few instructions to the SVC ISR to handle this. The way to fix this without architectural support (grrrrr) is

Noticing that an SVC is pending is... going to be interesting. On ARMv6-M processors where SHCSR is implemented at all and is actually readable from the CPU, we can read SHCSR. Yay. However, we can't detect whether this is the case at compile time. For all of the modeling of the CPUs down to the package that our PACs perform, they don't actually model this aspect. So that solution would be potentially flaky.

The alternative is for the SVC to interpret the exception stack frame, load the PC value, read the instruction just before it, and verify that it's an SVC. If it isn't, it must cancel itself, returning without changing any register state.

A minimal sequence to do this on ARMv6-M is approximately...

@ Get the unprivileged stack pointer into r0.
@ We can use r0-r3 as temporaries because they're stacked.
mrs r0, PSP
@ Load the PC value from the frame, which is the word at index 6.
ldr r0, [r0, #(6 * 4)]
@ Load the instruction _just before_ the PC. SVCs are 2 bytes.
@ We only care about the high-order byte of the 16-bit instruction,
@ as the low-order byte is an (unused) immediate argument.
@ v6m is always instruction-little-endian, so the high order byte
@ is one byte before the PC.
subs r0, #1  @ No negative displacement addressing on v6m
ldrb r0, [r0]  @ but at least we have byte loads!
cmp r0, #0xDF  @ ARMv6M ARM A6.7.17
bne recovery_path

...svc code goes here

That adds 10 cycles to the normal SVC dispatch path on an M0+, which is a small price to pay for correctness.

cbiffle commented 1 week ago

.......yeah so it occurs to me, my mitigation suggestion won't work.

Why?

Because in practice the control flow path is always...

  1. Supervisor blocks in RECV from the kernel
  2. Other task does stuff
  3. Other task attempts to make an SVC without enough stack
  4. Kernel fault handler kicks in and kills it
  5. Kernel attempts to transfer control to the supervisor
  6. ...which is, in fact, sitting in an SVC. Since that's how it made it to RECV. In fact, because the supervisor is the highest priority task, any time it's blocked it's sitting in an SVC.

So we're going to have to do something more subtle than that.