Open stephenrkell opened 3 years ago
Confirmed this is a problem for running bash
, no doubt among many other programs.
Once we cache our trap sites (#14), this will be easier. We'll have an in-memory record of where our trap sites are. So, we can test the trapping address against that, and delegate to the right handler (ours or the user's; should be disjoint).
Actually the problem is not just SIGILL handling... it's handling any signal. What happens on any application-provided signal handler? It returns, passing control to the libc's restorer... which tries to do sigreturn but traps (because we instrument libc). Any sigreturn we do will return from our SIGILL (prematurely!) but not from the application's signal handler, which is the next signal frame up the stack.
Probably this is the problem Guillaume encountered. We need to refrain from trapping libc's sigreturn syscall. If/when we switch to jump-based instrumentation, this will go away.
One idea I had is to perform sigreturn in userspace.
What does sigreturn do? It fixes up the executing context so that it will resume back at the point where the signal was raised.
In our case, we have a SIGILL raised at the syscall site. We tweak the saved context. And our sigreturn wants to resume using that saved-then-tweaked context, i.e. to resume at (after) the syscall site.
The user's sigreturn, which we have trapped, just wants to resume whatever context is on the user's stack. By trapping it we have inserted another sigframe context in the middle. We could just special-case our return-from-trapped-sigreturn by popping our stack pointer all the way back to the user's context? Then our sigreturn is doing the user's sigreturn. Hmm. It's worth a try. Put differently, this is an alternative to our "trap site plus two bytes" fix-up. The user never wants to run the instructions after their sigreturn syscall; they want to resume the program.
This might screw with the signal mask. Indeed we want to unmask the user's signal before we do this. In between doing that call and actually resuming the user code, another signal might come in. Is this a problem? Maybe. See davmac's comments on Twitter. https://twitter.com/stephenrkell/status/1417451829294739456
It also might screw with sigaltstack. Though if we are sigaltstack'd, then that is true for our SIGILL handler as well as for the user's code, no? So the same resume should work for both.
On arches where RT signals are distinct from 'normal' signals, we have a problem because the kind of resuming we can do from our own frame is only the SIGILL kind (non-RT). If the user's signal handling is trying to do the other kind of resume, we probably need to do the whole thing in userspace.
... and/or use Guillaume's solution: refrain from trapping those syscalls. It helps that most of the time, sigreturn is done only from a single chunk of restorer code.
Leaving that untrapped would make us insecure, of course... someone who knows where libc's restorer is can do an arbitrary syscall by jumping into it. Currently we have no sandboxing of ourselves though, so the same is true of our own syscall path. One day we might become a secure 'process supervisor' and will have to solve this, again probably by the userspace solution.
Remember: it may be best to invest in jump-based trapping rather than solving this. But the quick hack of sigreturn-directly is worth trying.
Actually, doing a sigreturn by jump-trampoline will be a problem too... the trampoline pushes stuff onto the stack, whereas sigreturn only works if the stack pointer is pointing in the right place. So we will need to detect sigreturns specially.
At the moment, applications with their own SIGILL handler are a problem. We can support this by chaining the handler, so that a SIGILL that isn't ours can be passed through. This may mean revisiting the instrumentation of
sigreturn
... Guillaume patched us so that we don't instrument this, but I have a feeling that is not correct. Rather, the sigreturn should happen in our own binary so should be immune from patching anyhow. If we delegate to a client's SIGILL handler, we may have to do the sigreturn on its behalf... this requires care, because it implies nested handling. We should be able to avoid that.