A task starts to make a system call (using the SVC instruction)
The processor respond to SVC by stacking an exception frame
But the task doesn't have enough stack space for that exception frame
And so the processor triggers the fault handler instead
However, SVC is still pending in the interrupt controller
So when the fault handler returns, attempting to return into the supervisor to handle the fault, the supervisor immediately issues a phantom SVC.
The "fix"
We fixed this on v7-M/v8-M by writing the SHCSR to un-pend SVC on any fault. Great. Works.
So here I am chasing down a subtle intermittent system failure on an M0+ chip (ARMv6-M) when I notice that I'm getting phantom SVCs, as though #1134 had not been fixed. 👻
Hitting the books, it appears from the ARMv6-M spec that...
SHCSR is part of the debug extension and may not be implemented. Cool, fine, all processors we're interested in support the debug extension. (It's mostly only omitted on embedded cores in ASICs.)
...but whether the SHCSR is accessible from the processor is implementation defined?? This isn't mentioned in the SHCSR section, it's tossed into a footnote in section C1.6 which reads: "The DWT, BPU, ROM table, DCB, and the SHCSR and DFSR registers are accessible through the DAP interface. Access from the processor is IMPLEMENTATION DEFINED." (It's ironic that a British processor company would be so inconsistent about the Oxford comma, but I digress.)
...and also it's not specifically defined as being writable from the CPU.
So I think our fix for this on ARMv6-M may never have worked, and we may have fallen victim to one of the classic blunders --- the most famous of which is, ‘never get involved in a land war in Asia,’ but only slightly less well-known is this:
Never assume that an architectural feature subsetted from ARMv7-M into ARMv6-M can simply be used, and hasn't been subtly changed in some way! 💀
The fix for the "fix"
On v6-M, I think we're going to have to add a global variable and a few instructions to the SVC ISR to handle this. The way to fix this without architectural support (grrrrr) is
In the HardFault handler, notice that the fault occurred while stacking an SVC, and set a flag in RAM.
In the SVC handler, check-and-clear this flag. If it's set, just return, preserving all register state.
Noticing that an SVC is pending is... going to be interesting. On ARMv6-M processors where SHCSR is implemented at all and is actually readable from the CPU, we can read SHCSR. Yay. However, we can't detect whether this is the case at compile time. For all of the modeling of the CPUs down to the package that our PACs perform, they don't actually model this aspect. So that solution would be potentially flaky.
The alternative is for the SVC to interpret the exception stack frame, load the PC value, read the instruction just before it, and verify that it's an SVC. If it isn't, it must cancel itself, returning without changing any register state.
A minimal sequence to do this on ARMv6-M is approximately...
@ Get the unprivileged stack pointer into r0.
@ We can use r0-r3 as temporaries because they're stacked.
mrs r0, PSP
@ Load the PC value from the frame, which is the word at index 6.
ldr r0, [r0, #(6 * 4)]
@ Load the instruction _just before_ the PC. SVCs are 2 bytes.
@ We only care about the high-order byte of the 16-bit instruction,
@ as the low-order byte is an (unused) immediate argument.
@ v6m is always instruction-little-endian, so the high order byte
@ is one byte before the PC.
subs r0, #1 @ No negative displacement addressing on v6m
ldrb r0, [r0] @ but at least we have byte loads!
cmp r0, #0xDF @ ARMv6M ARM A6.7.17
bne recovery_path
...svc code goes here
That adds 10 cycles to the normal SVC dispatch path on an M0+, which is a small price to pay for correctness.
.......yeah so it occurs to me, my mitigation suggestion won't work.
Why?
Because in practice the control flow path is always...
Supervisor blocks in RECV from the kernel
Other task does stuff
Other task attempts to make an SVC without enough stack
Kernel fault handler kicks in and kills it
Kernel attempts to transfer control to the supervisor
...which is, in fact, sitting in an SVC. Since that's how it made it to RECV. In fact, because the supervisor is the highest priority task, any time it's blocked it's sitting in an SVC.
So we're going to have to do something more subtle than that.
We are still haunted by the ghost of #1134.
Background
As a refresher, that bug happened because
The "fix"
We fixed this on v7-M/v8-M by writing the SHCSR to un-pend SVC on any fault. Great. Works.
So here I am chasing down a subtle intermittent system failure on an M0+ chip (ARMv6-M) when I notice that I'm getting phantom SVCs, as though #1134 had not been fixed. 👻
Hitting the books, it appears from the ARMv6-M spec that...
So I think our fix for this on ARMv6-M may never have worked, and we may have fallen victim to one of the classic blunders --- the most famous of which is, ‘never get involved in a land war in Asia,’ but only slightly less well-known is this:
Never assume that an architectural feature subsetted from ARMv7-M into ARMv6-M can simply be used, and hasn't been subtly changed in some way! 💀
The fix for the "fix"
On v6-M, I think we're going to have to add a global variable and a few instructions to the SVC ISR to handle this. The way to fix this without architectural support (grrrrr) is
Noticing that an SVC is pending is... going to be interesting. On ARMv6-M processors where SHCSR is implemented at all and is actually readable from the CPU, we can read SHCSR. Yay. However, we can't detect whether this is the case at compile time. For all of the modeling of the CPUs down to the package that our PACs perform, they don't actually model this aspect. So that solution would be potentially flaky.
The alternative is for the SVC to interpret the exception stack frame, load the PC value, read the instruction just before it, and verify that it's an SVC. If it isn't, it must cancel itself, returning without changing any register state.
A minimal sequence to do this on ARMv6-M is approximately...
That adds 10 cycles to the normal SVC dispatch path on an M0+, which is a small price to pay for correctness.