[RFC] refactor how tool is written

systrace allows using a tool shared library (tool) with --tool switch. A tool basically implements captured_syscall C API, so after systrace successfully patched a syscall site, it can generate trampoline and can jump to captured_syscall, so that we can intercerpt the original syscalls.

The tool is loaded by systrace using LD_PRELOAD, hence it is not usable after LD_PRELOAD is finished. There're already about 20+ syscalls called by ld-linux.so and they're not catchable. For now this is a hard limitation, however, we can still catch them by SECCOMP. once the tool is (LD_PRE)loaded, systrace tries to patch any syscall with predefined rules (in src/bpf.c). please note we only apply patching when the syscall and following instructions match our predefined pattern, hence, if there's no pattern match, patching would not occur. This makes write interception code cumbersome, because not all syscalls are catchable into captured_syscall function call in tracee's memory space. The plan is when such case happens, we could use ptrace SECCOMP stop to inject captured_syscall, forcing tracee to do this very function call. It is relatively easy to inject real syscalls, and we've done that in the past many times. however captured_syscall is a regular C function (written in rust), and it could use mmx/sse registers, hence it would be more difficult to inject it in the tracer, nonetheless, it should be possible with proper xsave/xrestore instructions.

In the future, we might install a second seccomp rule in tool's init function, so that we can patch the syscall either in tracee's memory space, or intercept the syscall in SIGSYS signal handler, but this also have risks such as the decoding of ucontext from the signal handler seems complicated, and redicting control flow in the same task seems more difficult than ptrace.

The tool library is running in tracee's memory space, however, because we intercept raw syscall, we must be very careful to avoid dead locks. i.e.: doing allocations could be dangrous, drop (inserted by rust) could be dangerous as well, because it may call pthread_xxx, which then may call futex syscall. Even there's no dead lock, doing the extra syscalls can cause performance degration. Thus the tool must be written in a very strong constrait. We also have a choice to use std or no_std. using no_std allows the tool not to have dependencies on any external library (including libc), because of that, we can rewrite the seccomp filters, allowing all syscalls inside tool memory range (by checking procfs). however, no_std variant is a lot more difficult to write, less documented, and have less libraries and features.

After serveral discussion, our captured_syscall could be look like:

pub extern "C" fn captured_syscall(
    p: &mut ProcessState,
    t: &mut ThreadState,
    a: &Args);

ProcessState holds resources sharing among threads, such as unix file descriptor, signal handlers, etc. while ThreadState holds resources local to any threads. The hard part is our trampoline, like a reguar syscall, doesn't know anything, except the syscall no and six arguments. We could allocate ProcessState during ptrace exec event; and allocate ThreadState both in exec event and fork/vfork/clone event. however, because the heap belongs to the tracee only, it could be quite difficult to prepare those data structures in the tracer, even with help of Serialize/Deserialize. It could be possible to abuse inject function calls once again, or we could rewrite all tracees' global allocator, forcing them use the same heap preallocated by the tracer. This isn't any easier by any means, i.e.: the tracer will need to expose some APIs to claim/reclaim memory to the tracees; so that tracees could use the exposed API to implements their own Global Allocator; It also seems very unsafe, because any tracee have access to the global heap, shared among the tracer and all tracees.

please note we only apply patching when the syscall and following instructions match our predefined pattern, hence, if there's no pattern match, patching would not occur

To clarify, by pattern you mean instruction patterns that can be easily patched right?

This makes write interception code cumbersome, because not all syscalls are catchable into captured_syscall function call in tracee's memory space

Because they were not patched, instead they were caught by SECCOMP which traps on a ptrace tracer?

however captured_syscall is a regular C function (written in rust), and it could use mmx/sse registers

You're worried about these registers being clovered here. Since classically we only save/restore the more common CPU registers.

allocations could be dangrous, drop (inserted by rust) could be dangerous as well, because it may call pthread_xxx, which then may call futex syscall.

So we're worried about Rust standard library doing system calls as part of the work.

however, no_std variant is a lot more difficult to write, less documented, and have less libraries and features

We would basically have to roll out our own data structures and call system calls ourselves. Granted this would be no different had we done it in C right? Assuming we don't need anything too fancy, we could insert our own mini-libc or functionality that we need. Write it once and use it everywhere? While technically unsafe, we could wrap our functions and data structures in safe interfaces.

or we could rewrite all tracees' global allocator, forcing them use the same heap preallocated by the tracer

I prefer the approach of avoiding rust stdlib all together and hand managing data structures and memory.

To clarify, by pattern you mean instruction patterns that can be easily patched right? Yes, most syscalls have ssimilar patterns, such as:
0f 05                   syscall 
48 3d 00 f0 ff ff       cmp    $0xfffffffffffff000,%rax
Because they were not patched, instead they were caught by SECCOMP which traps on a ptrace tracer? Right

You're worried about these registers being clovered here. Since classically we only save/restore the more common CPU registers. Yes, for syscalls basically we only have to:

push parameters and return address onto tracee's stack
save caller saved registers
set syscall registers (rax+6 args)
syscall
restore caller saved registers
adjust sp and do a retq

Of course if we have ptrace stops or can use breakpoint instruction it would be even easier. For regular function calls, rather than save caller saved registers (rax/rdi/rsi/rdx/rcx/r8/r9/r10/rbx), we also have to save FP registers and xmm/ymm registers, there're instructions like xsave/xrstore so it should be possible.

So we're worried about Rust standard library doing system calls as part of the work. Yes rust make that quite implicit (even more so than c++), so we need to be careful

We would basically have to roll out our own data structures and call system calls ourselves. Granted this would be no different had we done it in C right? Assuming we don't need anything too fancy, we could insert our own mini-libc or functionality that we need. Write it once and use it everywhere? While technically unsafe, we could wrap our functions and data structures in safe interfaces.

Right, with C we actually have more direct control on how the tool is linked, for rust it is harder. For instance, with C we can built libc.a from musl-libc, then link our tool with libc.a (static), then use objcopy -G<symbol_a> -G<symbol_b> ... to control symbol visibility. with rust I've found no_std is the only way to archive that so far. Rust does have musl target, but it doesn't work well with cdylib, at least with +crt-static (for cdylib).

I think use no_std is a better choice too, as mentioned, it has its own downside, none the less.

forcing them use the same heap preallocated by the trace

Are you referring here to the "shared global memory" option (rather than the message-passing/RPC approach to globalState)? We have a complicated decision tree of possible futures we're considering, so good to clarify which branch we're on ;-).

because of that, we can rewrite the seccomp filters, allowing all syscalls inside tool memory range (by checking procfs)

Why is this additional "whitelisting" approach specific to no_std only? Even if you have a tool/plugin that uses full featured libc + Rust stdlib, as long as everything is statically linked, couldn't you in principle whitelist all code inside that tool?

The prerequisite is to make sure the tool shared library is a standalone library doesn't link against any other libraries, so that everything is self contained. If the guarantee satisfies, then we know it has all its syscall instruction self-contained as well, so that we can create a filter, allow all syscall to be whitelisted within the tool.

It would not work if the tool linked with external library, such as glibc, because when the tool calls read@glibc, it would escaped the whitelist, and we're not whitelisting glibc syscalls.

reverie-rs / reverie

[RFC] refactor how tool is written #45