reverie-rs / reverie

trace and intercept linux syscalls.
Other
14 stars 5 forks source link

Gather full per-process statistics for a make job with current bad slowdown #37

Open rrnewton opened 5 years ago

rrnewton commented 5 years ago

@wangbj - which make job was it that had 4X "count tool" slowdown currently?

Let's focus on that job for the moment and build up a couple CSV datasets representing:

1) aggregate information per-process/per-patch-site, including counts for intercepted events, but also realtime and number of instructions executed by the process; and 2) temporal information (e.g. a log of syscalls, showing which were "early" -- before our dylib finishes loading, and tagging these log entries with the patch-site locations too).

Of course a really detailed log could be post-processed to generate whatever counts we want too. A variant of (1) would be count information at the level of libc calls augmenting the syscall-/instruction-level information. This would help answer questions of the form:

How many patched syscalls fail to amortize (defined as being called less than N times in that process, where need to empirically characterize N)?

Or:

For shortlived (< 100ms) processes, what are the top k libc functions, for which hooking would have the largest reduction on the number of non-amortized patch sites.

A conversation with @mikerainey could help shed light on algorithm design for amortizing these overheads in a provably efficient way (that's what he specializes in, but in the different context of parallelism overheads).

wangbj commented 5 years ago

The systrace (release mode) master branch has about 100% slow down. If we make systrace single threaded as detTrace, then we have about 4X slow down.

time when building glibc-2.27:

name w/o systrace: w systrace.
real 2m26.482s 4m3.677s
user 6m48.702s 0m53.472s
sys 1m13.110s 2m1.536s

Getting a very detailed log per-process per-patchsite-per-syscall is a bit complicated, I plan to send over the events via a unix domain socket events, to a systrace server, what do you think?

FYI the counter tool has a --show-perf-stats command line switch, though it is very limited:

    nr_syscalls: 38425456,
    nr_syscalls_ptraced: 4233871,
    nr_syscalls_patched: 754944,
    nr_syscalls_captured: 34191585,
    nr_read_retries: 0,
    nr_write_retries: 0,
    nr_getrandom: 0,
    nr_urandom_opens: 0,
    nr_random_opens: 0,
    nr_time_calls: 0,
    nr_total_replays: 0,
    nr_blocking_replays: 0,
    nr_injected_syscalls: 0,
    nr_rdtsc_events: 0,
    nr_rdtscp_events: 0,
    nr_cloned: 1,
    nr_forked: 44646,
    nr_exited: 44648,
    nr_process_spawns: 42271,
syscalls ptraced (slow): 11.02%
syscalls captured(w/ patching): 88.98%
syscalls captured(wo/ patching): 87.02%
rrnewton commented 5 years ago

Logging doesn't have to be high performance -- can't you just dump it all out to a file our stderr or whatever?

For documenting perf overhead, it seems like there should be three variants/modes for systrace:

So does the 100% overhead quoted above represent a fully safe multithreaded version, or an unsafe one?

Because systrace has more of a "global" view of the execution than liteinst does, it's very interesting that it could have a dynamic single-threaded-optimization. (I.e. main starts out single threaded, and you can patch cheaply until other threads are forked.) Maybe @wangbj's protocol leverages this, I don't remember it well enough atm.

wangbj commented 5 years ago

It is the safe-multi-threaded version.

The document can be found at:

https://github.com/iu-parfunc/systrace/blob/master/docs/syscall-patch.md

rrnewton commented 5 years ago

There are a couple things I don't understand about the output of --show-perf-stats.

syscalls captured(wo/ patching): 87.02%

These don't add to 100%!

    nr_syscalls_patched: 754944,

Does that refer to the number of patch sites in the code? That seems high! But this is across 42,271 total processes, so perhaps not. Still this is a totally excessive amount of patching that we need to reduce w/ any of those techniques we've discussed [1]

These statistics are good, but they lose a lot of information by summing across all processes. It would be good to get a "row per process" with them summing to the quoted numbers. It would also be good to start to break down per-syscall-type and per-patch site to dig into where the highest overheads are coming from empirically and start hacking away at it.

[1] prepatching, LD_PRELOAD, in-process patching w/ SIGSYS, skipping some syscalls, etc...

wangbj commented 5 years ago

syscalls ptraced (slow): 11.02%

These are un-patched syscalls, they are handled by ptrace

syscalls captured(w/ patching): 88.98%

captured syscalls, patch applied

syscalls captured(wo/ patching): 87.02%

captured syscalls, because the patch has been applied already, these numbers means they syscalls transparent to the tracer

Does that refer to the number of patch sites in the code? That seems high! But this is across 42,271 total processes, so perhaps not. Still this is a totally excessive amount of patching that we need to reduce w/ any of those techniques we've discussed [1]

the patch number could be lower than the number of calls after it is being patched, i.e.: read might be patched only once, but could be called hundreds of times.