optimize how we patch blocking syscalls

wangbj commented 5 years ago

With current design, if a syscall blocks, systrace don't patch it until it returns. The reason behind that is because if we do patch, when the original syscall is blocked, after it resumes it see invalid instructions after the two-byte syscall instruction. best case is we get SIGILL or SIGSEGV, worst case it the trail three-byte could be a valid instruction sequence, which lead to undefined behavior.

Though we still cannot patch when a syscall is blocked, we can however make the blocking window a lot shorter, such as modifying the syscall parameters, to make it non-blocking. Another approach is we can also patch certain syscalls before hand, so that we wouldn't have to worry about it later.

building glibc can easily expose this issue: the build process seems create tons of pipes, and causes lots of blocking read/write.

rrnewton commented 5 years ago

@wangbj - I need you to unpack this for me a bit further, because I don't understand why we need to ever allow the code to return to the instruction after the original syscall (PC = orig_syscall + 2).

If we turn the very 1st attempt to execute the syscall into a trap, then the handler runs before the syscall ever gets to -- effectively a prehook. If we do ultimately execute a blocking syscall, it should be via the untraced_syscall function right? We should always call the captured_syscall function, irrespective of how the event was intercepted (trap or patched code site), right? There's not some way that individual syscall invocations slip through the cracks and don't get intercepted, is there? (Which would mean they can genuinely block at the syscall PC.)

In fact, I think the following theorem should hold in general:

Theorem: No syscall in the original app should ever be executed from its original address in the code (except the single instruction inside the body of untraced_syscall)

If this theorem is false for our design (and worse, cannot be made true), then I want to understand why.

wangbj commented 5 years ago

You're right, there's a bug when handling ptrace_event_exec, the patched_syscalls field should be zeroed, because exec* replace the entire program's code/data. The issue you mentioned should be fixed by commit 80e47d65. But we still have the needs of patching the same syscalls repeatedly for every exec*-ed new processes.

rrnewton commented 5 years ago

Wait, so does the theorem hold? It's hard for me to understand how that linked patch connects to the issue of patching blocking syscalls (which is an issue even if we never call fork/exec, right?).

wangbj commented 5 years ago

I believe so, there's a patched_syscall member for each task (or tracee), to keep record of patched syscall sites, when we exec, this field should have been cleared, because the old patched_syscall doesn't apply to the new task (at least for now), as exec just creates a brand new context.

To elaborate we won't try to patch a syscall site, if it was recorded in patched_syscall, hence why we see a lots of syscalls going through with secomp instead.

chamibuddhika commented 5 years ago

I have a query which I feel is related to discussion. What would be the control flow in which handle_syscall_exit reached?

https://github.com/iu-parfunc/systrace/blob/58e6b261c9035cc912c61847255289f1ae8b0530/src/traced_task.rs#L770

wangbj commented 5 years ago

I have a query which I feel is related to discussion. What would be the control flow in which handle_syscall_exit reached?

This is the SECCOMP syscall exit, it is caused by call ptrace(pid, PTRACE_SYSCALL,...) when entered SECCOMP syscall enter stop.

reverie-rs / reverie

optimize how we patch blocking syscalls #23