Closed mkeeter closed 1 year ago
Interrupt chaining was indeed involved ... but not in the way we expected! Turns out SVCall stays pended if you take a memory management fault while stacking the exception frame for SVCall. This produces reality-melting behavior. I've posted #1139 as a fix for this behavior, and in my testing on your branch it seems to work.
In discussions with @lzrd and @kc8apf we realized this could also happen if you triggered a usage or bus fault with too little stack -- it would produce a derived memmanage fault, but remain pending, so it'd be handled on return. At the time we return from the memmanage fault, we've already switched into the supervisor, so the fault would be blamed on the supervisor, which'd be... bad.
Fortunately the approach I'm using in #1139 can be adapted to cover both cases for little additional cost.
I'm going to see about adding a regression test to the kernel test suite for this, though it may be rather involved to do so.
This is very mysterious.
If you build
sidecar/rev-b.toml
on commit6465f006f3aacc7c51b6a0b8114438044d703d06
, thecontrol_plane_agent
task has very little stack margin – so little, in fact, that you can trigger a stack overflow by talking to it:(this will time out)
After this failure, the system should be in an odd state:
control_plane_agent
will be faulted, andjefe
will be waiting on thefault
bit (1) but not its timer bit (2), which should never happen.Normally,
jefe
should be notified by the kernel when a task faults, and will restart it. It's unclear why this isn't happening.Adding an infinite loop to
configurable_fault
here (right before the return) shows a system that should return tojefe
:Here's the saved
jefe
state, which looks like the kernel sending a notification of 1:Because
LR
is0xFFFFFFED
, it will return by looking at PSP. We, too, can look at PSP:This shows that it's about to return to
08009692
, which is oursys_recv_stub
function:However, it never seems to make it: adding a jefe-specific trap in
userlib::sys_recv_stub
, it never seems to be entered.One theory is something about interrupt chaining going wrong: if there are pending interrupts when the final
bx lr
inconfigurable_fault
is evaluated, then it will handle them instead, and that... somehow... eventually prevents jumping tojefe
? A piece of evidence for this theory: in the infinite-loop-before-bx lr
code, we seejefe
listed asRunnable
; however, if we remove that loop and let thebx lr
execute,jefe
ends up inHealthy(InRecv(None))
.One more observation: the
net
task will panic itself every 60 seconds if it doesn't see any traffic. When that happens,jefe
is woken up and restartscontrol_plane_agent
, as it should.If I artificially lower the
control_plane_agent
stack size on Gimletlet, this does not reproduce; haven't tried yet on Gimlet.