Most system calls can block for different reasons, such as mmap, I'm not totally sure whether or not it is going to block every time. hence maybe we only need a subset of truly blocking syscalls, such as:
blocking on FDs, this includes read/write/select/poll/epoll
blocking on PIDs, the wait4 syscall
blocking on futexes, the futex syscall
blocking on signals, the rt_sigtimedwait syscall
blocking on timers
For the rest (mmap), the blocking should be considered transient, and we shouldn't reschedule it (add it to blocked queue).
We can read /proc/<pid>/status to check task <pid> status, however, this can be racy, it would be better to have a event based system, when a task switch happens, send a notification (to the tracer). this is pretty hard to implement, my thought was when a tracee had a context switch, a signal should be sent to the tracer; this can be done, however, in the tracer's signal handler, there's no way to tell where the signal is coming from: the siginfo_t have a valid _sigpoll struct, that is: only si_band and si_fd are valid. But we cannot read is_fd, because it belongs to tracee. There's a crazy idea to set the perf_eventfd to tid+fd_offset, so each tracee's si_fd would be enough to tell (the tracer's signal handler) the origin of the signal. But I think that would be too much of absurdity to implement.
We can also do that by using linux trace events, such as ftrace, it can be done by enabling certain sched events using ftrace, it would be pretty hard to implement.
Another way to do that is using bcc, it allows us to install kernel probes dynamically (as kernel modules), with the downsides of:
hard to implement, at least as hard as using ftrace
requires real root privilege
has license limitations: you cannot use just BSD/MIT license, the best compromise is dual BSD/GPL license, I have no idea what that would actually implies
not language agnostic: bcc supports cpp/python/lua only, using other programming languages relies on third parity bindings (such as rust).
Most system calls can block for different reasons, such as
mmap
, I'm not totally sure whether or not it is going to block every time. hence maybe we only need a subset of truly blocking syscalls, such as:read/write/select/poll/epoll
wait4
syscallfutex
syscallrt_sigtimedwait
syscallFor the rest (
mmap
), the blocking should be considered transient, and we shouldn't reschedule it (add it to blocked queue).We can read
/proc/<pid>/status
to check task<pid>
status, however, this can be racy, it would be better to have a event based system, when a task switch happens, send a notification (to the tracer). this is pretty hard to implement, my thought was when atracee
had a context switch, a signal should be sent to the tracer; this can be done, however, in the tracer's signal handler, there's no way to tell where the signal is coming from: thesiginfo_t
have a valid_sigpoll
struct, that is: onlysi_band
andsi_fd
are valid. But we cannot read is_fd, because it belongs to tracee. There's a crazy idea to set theperf_event
fd
totid+fd_offset
, so each tracee'ssi_fd
would be enough to tell (the tracer's signal handler) the origin of the signal. But I think that would be too much of absurdity to implement.We can also do that by using linux trace events, such as
ftrace
, it can be done by enabling certain sched events usingftrace
, it would be pretty hard to implement.Another way to do that is using
bcc
, it allows us to install kernel probes dynamically (as kernel modules), with the downsides of:ftrace