Closed joneschrisg closed 10 years ago
One option available to us is preloading on top of NSPR symbols. FF only uses the NSPR primitives (to a very good first approximation), and not in any unusual ways. (I would know, because I wrote that code ;) .) I'm not sure if this makes the problem easier or harder, but the behaviors of the NSPR types are a strict subset of pthread primitives, and the users are a very small subset too.
I'm not sure how that would help. Presumably PR_Lock requires everything the common case of pthread_mutex_lock does, so we'd end up writing a similar amount of code. (I assume we would write a specialized pthread_mutex_lock override that handles the common cases we care about and punts exotic cases back to the real implementation.) And of course support for pthread_mutex_lock is more generally useful to non-Firefox projects.
I agree with all that, however, I still point back at the fact that with NSPR we're dealing with a smaller set of behaviors (even for the basic pthread options) and fewer users. This code is fantastically tricky to write, and libc/linux have historically gotten it wrong several times. Just pointing out that, as a first step, we have the option to make the problem easier.
I'm sure the answer will become clear when we start hacking on this.
Depends on #686.
This is the big win from #686, and probably the scariest change, so want to take this asap.
We're thankfully able to intercept the big futex hogs from #9, WAIT/WAKE/CMP_REQUEUE.
My patch is causing an interesting problem: we're failing to Task::write_bytes()
(and read_bytes()) at the first stack page. I know that the memory is accessible because PTRACE_PEEK_DATA can read it.
My patch seems to be working now, but it's not having as big of an effect on #9 as I would have thought. We make 3x fewer traced futex calls than in the most recent data there, but perf doesn't improve much. Two possibilities come to mind: most futex calls end up being desched'd, so buffering doesn't help much; the heavy futex-using threads don't ever initialize their buffers, because of #730.
With some quick 'n dirty librrpreload stats, I see
FUTEX STATS: unhandled cmds: 2600 no syscallbuf: 10711 buffered: 7594 (desched: 2451)
and the histogram-trace says, for the same run
count % .... syscall ----- ------- 42186 41.94% .... futex (240)
So sys_futex() vsyscall hook saw 20905 total futex calls, of which 15762 resulted in traps to rr. That's not good at all. And librrpreload missed hooking 26424 out of the 42186 futex traps to rr. That's also not good.
So we have these leads
About a third of the few calls we manage to buffer are being desched'd, but I don't have a feel for whether that's bad or not.
OK, very good news: my histogram script was double-counting calls (counting entry and exit). So the way things actually look are
FUTEX STATS (from librrpreload): unhandled cmds: 2608 no syscallbuf: 10667 buffered: 7668 (desched: 2466) ----------------------- total: 20943 count % .... syscall ----- ------- 21080 41.54% .... futex (240)
So all but about 100 futex calls went through librrpreload. The two callsites I caught were pthread_cond_broadcast() (FUTEX_CMP_REQUEUE) and pthread_cond_signal() (FUTEX_WAKE_OP), in the old FIXME: Until Ingo fixes 4G/4G vDSO, 6 arg syscalls are broken for sysenter.
code. That's worrying but we need more data.
So by far the biggest issue appears to be that half the futex calls are either not able to init syscallbuf, or are somehow disproportionately responsible for buffer flushing.
(I'd like to run these numbers for the layout/base tests, but those fork subprocesses which makes the rrpreload instrumentation hard to collect.)
So all but about 100 futex calls went through librrpreload.
Scratch that: there were 15741 calls from rrpreload that trapped to rr, out of 21080 that trapped total. So that means the CMP_REQUEUE's and WAKE_OP's must be accounting for the remaining ~5000 traps. Possibly sufficient motivation to patch glibc.
But this is still good news overall; we can buffer up to 75% of the futexes (from this run). Futex is almost 50% of the traced syscalls (and buffer flushes) so this is still the biggest potential net win.
There's a strange strange hang occurring with sigprocmask and clone. None of the involved tasks are at futex calls, so not quite sure what's up. I did see that we don't clear the thread-local buffer params before forking, which we need to guard against glibc fork code accidentally using the parent's syscallbuf after fork but before after_fork_child. Suspect something TLS related.
What's going on here is that a tracee initializes its syscallbuf, but then it runs out to a buffered FUTEX_WAIT while a sigmask is applied that's blocking SIGSYS. Because of #731, rr isn't being notified of the desched event and can't switch out the tracee. So I don't think the futex hooks are at fault, per se.
Followup to #5. Quoting there