joneschrisg commented 11 years ago

Followup to #5. Quoting there

As a note for posterity, glibc directly invokes futex; there's no unresolved symbol we can preload on top of. So we can't buffer it with our current approach. Shame.

Any idea which symbols/how many symbols we'd need to override via LD_PRELOAD to catch most futex traffic?

I would guess that pthreadmutex{lock, unlock} and cond_wait would get us the vast vast majority. Doing that is a little scary and nontrivial though. But it's relatively easy to quantify how worth doing it would be once we have the structure in place.

joneschrisg commented 11 years ago

One option available to us is preloading on top of NSPR symbols. FF only uses the NSPR primitives (to a very good first approximation), and not in any unusual ways. (I would know, because I wrote that code ;) .) I'm not sure if this makes the problem easier or harder, but the behaviors of the NSPR types are a strict subset of pthread primitives, and the users are a very small subset too.

rocallahan commented 11 years ago

I'm not sure how that would help. Presumably PR_Lock requires everything the common case of pthread_mutex_lock does, so we'd end up writing a similar amount of code. (I assume we would write a specialized pthread_mutex_lock override that handles the common cases we care about and punts exotic cases back to the real implementation.) And of course support for pthread_mutex_lock is more generally useful to non-Firefox projects.

joneschrisg commented 11 years ago

I agree with all that, however, I still point back at the fact that with NSPR we're dealing with a smaller set of behaviors (even for the basic pthread options) and fewer users. This code is fantastically tricky to write, and libc/linux have historically gotten it wrong several times. Just pointing out that, as a first step, we have the option to make the problem easier.

I'm sure the answer will become clear when we start hacking on this.

joneschrisg commented 10 years ago

Depends on #686.

joneschrisg commented 10 years ago

This is the big win from #686, and probably the scariest change, so want to take this asap.

joneschrisg commented 10 years ago

We're thankfully able to intercept the big futex hogs from #9, WAIT/WAKE/CMP_REQUEUE.

joneschrisg commented 10 years ago

My patch is causing an interesting problem: we're failing to Task::write_bytes() (and read_bytes()) at the first stack page. I know that the memory is accessible because PTRACE_PEEK_DATA can read it.

joneschrisg commented 10 years ago

My patch seems to be working now, but it's not having as big of an effect on #9 as I would have thought. We make 3x fewer traced futex calls than in the most recent data there, but perf doesn't improve much. Two possibilities come to mind: most futex calls end up being desched'd, so buffering doesn't help much; the heavy futex-using threads don't ever initialize their buffers, because of #730.

joneschrisg commented 10 years ago

With some quick 'n dirty librrpreload stats, I see

FUTEX STATS:
  unhandled cmds: 2600
   no syscallbuf: 10711
        buffered: 7594  (desched: 2451)

and the histogram-trace says, for the same run

count        %  .... syscall
-----   -------
 42186   41.94% .... futex (240)

So sys_futex() vsyscall hook saw 20905 total futex calls, of which 15762 resulted in traps to rr. That's not good at all. And librrpreload missed hooking 26424 out of the 42186 futex traps to rr. That's also not good.

So we have these leads

where are the non-vsyscall-hook futex traps coming from
why are 60% of the futex calls that are making it to rr's hook, not seeing an initialized syscallbuf. This could #730.

About a third of the few calls we manage to buffer are being desched'd, but I don't have a feel for whether that's bad or not.

joneschrisg commented 10 years ago

OK, very good news: my histogram script was double-counting calls (counting entry and exit). So the way things actually look are

FUTEX STATS (from librrpreload):
  unhandled cmds: 2608
   no syscallbuf: 10667
        buffered: 7668  (desched: 2466)
 -----------------------
           total: 20943
count        %  .... syscall
-----   -------
 21080   41.54% .... futex (240)

So all but about 100 futex calls went through librrpreload. The two callsites I caught were pthread_cond_broadcast() (FUTEX_CMP_REQUEUE) and pthread_cond_signal() (FUTEX_WAKE_OP), in the old FIXME: Until Ingo fixes 4G/4G vDSO, 6 arg syscalls are broken for sysenter. code. That's worrying but we need more data.

So by far the biggest issue appears to be that half the futex calls are either not able to init syscallbuf, or are somehow disproportionately responsible for buffer flushing.

(I'd like to run these numbers for the layout/base tests, but those fork subprocesses which makes the rrpreload instrumentation hard to collect.)

joneschrisg commented 10 years ago

So all but about 100 futex calls went through librrpreload.

Scratch that: there were 15741 calls from rrpreload that trapped to rr, out of 21080 that trapped total. So that means the CMP_REQUEUE's and WAKE_OP's must be accounting for the remaining ~5000 traps. Possibly sufficient motivation to patch glibc.

But this is still good news overall; we can buffer up to 75% of the futexes (from this run). Futex is almost 50% of the traced syscalls (and buffer flushes) so this is still the biggest potential net win.

joneschrisg commented 10 years ago

There's a strange strange hang occurring with sigprocmask and clone. None of the involved tasks are at futex calls, so not quite sure what's up. I did see that we don't clear the thread-local buffer params before forking, which we need to guard against glibc fork code accidentally using the parent's syscallbuf after fork but before after_fork_child. Suspect something TLS related.

joneschrisg commented 10 years ago

741 fixed the TLS hazard.

What's going on here is that a tracee initializes its syscallbuf, but then it runs out to a buffered FUTEX_WAIT while a sigmask is applied that's blocking SIGSYS. Because of #731, rr isn't being notified of the desched event and can't switch out the tracee. So I don't think the futex hooks are at fault, per se.

rr-debugger / rr

Buffer non-6-arg futex syscalls #217

741 fixed the TLS hazard.