Gather full per-process statistics for a make job with current bad slowdown

rrnewton commented 5 years ago

@wangbj - which make job was it that had 4X "count tool" slowdown currently?

Let's focus on that job for the moment and build up a couple CSV datasets representing:

1) aggregate information per-process/per-patch-site, including counts for intercepted events, but also realtime and number of instructions executed by the process; and 2) temporal information (e.g. a log of syscalls, showing which were "early" -- before our dylib finishes loading, and tagging these log entries with the patch-site locations too).

Of course a really detailed log could be post-processed to generate whatever counts we want too. A variant of (1) would be count information at the level of libc calls augmenting the syscall-/instruction-level information. This would help answer questions of the form:

How many patched syscalls fail to amortize (defined as being called less than N times in that process, where need to empirically characterize N)?

Or:

For shortlived (< 100ms) processes, what are the top k libc functions, for which hooking would have the largest reduction on the number of non-amortized patch sites.

A conversation with @mikerainey could help shed light on algorithm design for amortizing these overheads in a provably efficient way (that's what he specializes in, but in the different context of parallelism overheads).

wangbj commented 5 years ago

The systrace (release mode) master branch has about 100% slow down. If we make systrace single threaded as detTrace, then we have about 4X slow down.

time when building glibc-2.27:

name	w/o systrace:	w systrace.
real	2m26.482s	4m3.677s
user	6m48.702s	0m53.472s
sys	1m13.110s	2m1.536s

Getting a very detailed log per-process per-patchsite-per-syscall is a bit complicated, I plan to send over the events via a unix domain socket events, to a systrace server, what do you think?

FYI the counter tool has a --show-perf-stats command line switch, though it is very limited:

    nr_syscalls: 38425456,
    nr_syscalls_ptraced: 4233871,
    nr_syscalls_patched: 754944,
    nr_syscalls_captured: 34191585,
    nr_read_retries: 0,
    nr_write_retries: 0,
    nr_getrandom: 0,
    nr_urandom_opens: 0,
    nr_random_opens: 0,
    nr_time_calls: 0,
    nr_total_replays: 0,
    nr_blocking_replays: 0,
    nr_injected_syscalls: 0,
    nr_rdtsc_events: 0,
    nr_rdtscp_events: 0,
    nr_cloned: 1,
    nr_forked: 44646,
    nr_exited: 44648,
    nr_process_spawns: 42271,
syscalls ptraced (slow): 11.02%
syscalls captured(w/ patching): 88.98%
syscalls captured(wo/ patching): 87.02%

rrnewton commented 5 years ago

Logging doesn't have to be high performance -- can't you just dump it all out to a file our stderr or whatever?

For documenting perf overhead, it seems like there should be three variants/modes for systrace:

single-threaded: assumes that either the app is single-threaded, or that a higher-level tool forces sequential scheduling of threads.
unsafe multi-threaded: basically the single-threaded version unsafely applied to multithreaded code, where there may be a race between execution and patching on a cacheline-straddling instruction:
safe multi-threaded: assumes arbitrary threads may be running the patched code (at least after the first pthread_create w/in a process). Has to patch carefully, like in wordpatch / liteinst (CC @chamibuddhika ), or like in @wangbj's custom locking protocol (remind me: where was that documented?).

So does the 100% overhead quoted above represent a fully safe multithreaded version, or an unsafe one?

Because systrace has more of a "global" view of the execution than liteinst does, it's very interesting that it could have a dynamic single-threaded-optimization. (I.e. main starts out single threaded, and you can patch cheaply until other threads are forked.) Maybe @wangbj's protocol leverages this, I don't remember it well enough atm.

wangbj commented 5 years ago

It is the safe-multi-threaded version.

The document can be found at:

https://github.com/iu-parfunc/systrace/blob/master/docs/syscall-patch.md

rrnewton commented 5 years ago

There are a couple things I don't understand about the output of --show-perf-stats.

syscalls captured(wo/ patching): 87.02%

These don't add to 100%!

    nr_syscalls_patched: 754944,

Does that refer to the number of patch sites in the code? That seems high! But this is across 42,271 total processes, so perhaps not. Still this is a totally excessive amount of patching that we need to reduce w/ any of those techniques we've discussed [1]

These statistics are good, but they lose a lot of information by summing across all processes. It would be good to get a "row per process" with them summing to the quoted numbers. It would also be good to start to break down per-syscall-type and per-patch site to dig into where the highest overheads are coming from empirically and start hacking away at it.

[1] prepatching, LD_PRELOAD, in-process patching w/ SIGSYS, skipping some syscalls, etc...

wangbj commented 5 years ago

syscalls ptraced (slow): 11.02%

These are un-patched syscalls, they are handled by ptrace

syscalls captured(w/ patching): 88.98%

captured syscalls, patch applied

syscalls captured(wo/ patching): 87.02%

captured syscalls, because the patch has been applied already, these numbers means they syscalls transparent to the tracer

Does that refer to the number of patch sites in the code? That seems high! But this is across 42,271 total processes, so perhaps not. Still this is a totally excessive amount of patching that we need to reduce w/ any of those techniques we've discussed [1]

the patch number could be lower than the number of calls after it is being patched, i.e.: read might be patched only once, but could be called hundreds of times.

reverie-rs / reverie

Gather full per-process statistics for a make job with current bad slowdown #37