Divergence replaying recorded Firefox content/media/tests

rocallahan commented 10 years ago

Following on from #1241. I can record content/media/tests but replay diverges quite late in the run.

We diverge very soon after a clock_gettime syscall. This syscall would normally be buffered but in this case it was not, apparently because the syscallbuf is locked.

rocallahan commented 10 years ago

Interestingly, no syscalls in the pulseaudio thread run buffered. I'd like to figure out why that is, but it does mean that the clock_gettime call before divergence is not special.

rocallahan commented 10 years ago

It appears that during recording, pulseaudio's poll_func's call to pthread_mutex_unlock performed a FUTEX_UNLOCK_PI which we recorded in the trace. During replay, that call to pthread_mutex_unlock does not do any futex operations. Divergence is detected when we do a poll syscall instead of the futex syscall.

rocallahan commented 10 years ago

pthread_mutex_unlock would only do the futex syscall if the mutex's __lock field indicated waiters by having its high bit set. During replay the bit is definitely not set. I wonder how it could be set during recording. The trace shows that a little earlier, we did a FUTEX_UNLOCK_PI on the same mutex from the same thread (looks like the previous iteration of poll_func) and there were no context switches between that futex call and the new one, so how could there be contention?

rocallahan commented 10 years ago

Aha! Further back in the trace another thread (say T1) calls FUTEX_LOCK_PI on this futex, and waits. Then we switch to the pulseaudio event loop thread, which finishes its own FUTEX_LOCK_PI (I guess some third thread released the lock) to get the lock. Then that thread does some processing, releases the lock (but doesn't yield), does a poll(), presumably takes the lock without a syscall, does more work, and releases the lock again. This second release sees waiters.

I suspect we have here some kind of race where the kernel sets the waiters bit at a bad time. I have a bad feeling I'll need to understand the comment on prep_futex_lock_pi...

rocallahan commented 10 years ago

The pulseaudio thread's first release is done with a FUTEX_UNLOCK_PI which indicates the waiters flag was set. After the release, the lock is fully unlocked. I think one possibility for what could have happened is that thread T1's system call tries again to acquire the lock --- after pulseaudio has taken it again. That would set the waiters bit again behind rr's back.

I'm still pondering the comment on prep_futex_lock_pi, but I think the analysis there doesn't consider the possibility that, after rr has set the waiters bit, thread A releases and reacquires the futex, clearing the waiters bit and allowing the kernel's setting of the bit to be observed.

rocallahan commented 10 years ago

Yes, it definitely looks the current mechanism will fail if the futex already has the waiters bit set. We'll "wait for the syscall to set the bit" but since it's already set, our wait will do nothing and the kernel is free to go ahead and set it "later".

rocallahan commented 10 years ago

We could clear the waiters bit before we allow a contented FUTEX_LOCK_PI syscall to proceed . That will let us reliably "absorb" the kernel's waiters-bit write. The problem then is that after a thread calls FUTEX_UNLOCK_PI, multiple kernel threads could be woken up, one of which takes the lock and one of the others sets the waiters bit (again). That write will not be observed by rr.

rocallahan commented 10 years ago

I haven't got any good ideas here. Needs more thought.

The fallback option is to tackle issue #674 and emulate futexes in rr. I think that would simplify things overall, though we might lose syscallbuf support for FUTEX_WAKE. We probably don't need SYSEMU-during-recording trickery; we can neutralize futex syscalls by munging syscall parameters.

rocallahan commented 10 years ago

I think the most sane approach here is going to be to emulate all priority-inheriting futex operations in rr, leaving WAIT and WAKE alone. As a first cut we can ignore priority inheritance completely. Priority inheritance is not commonly used so I'm not worried about perf impact.

Getting signal handling right will be tricky.

I don't like this to block 2.0 but I think generally it's a good idea for diagnosed divergence bugs to block a release, since they're hard to diagnose and if someone hits this in the wild a lot of effort would be required to figure it out.

@cgjones, any thoughts?

rocallahan commented 10 years ago

I'm trying to write a testcase for this and failing. I have multiple threads all trying to get the same PI mutex. Whenever the owning thread does a FUTEX_UNLOCK_PI, the kernel immediately gives the mutex to one of the waiters and sets the waiters bit due to the existence of more waiters, so when we record the new value at *uaddr it always has the waiters bit set already. I don't know how to reproduce the situation observed in this bug where a kernel thread delays setting the waiters bit. There may not even be a reliable way to reproduce it.

rocallahan commented 10 years ago

I've written code to implement FUTEX_LOCK_PI by trying to get the lock and if we fail, doing a FUTEX_WAIT instead followed by a retry of the FUTEX_LOCK_PI syscall. FUTEX_UNLOCK_PI is translated to a FUTEX_WAKE. The invariant is that if, in a real execution, a thread has a FUTEX_LOCK_PI waiting on futex F, in rr recording the thread is FUTEX_WAITing on F. This seems to work OK in recording; the replay code isn't written yet. It's quite clean because we don't actually have to track waiters; the kernel does it for us.

Implementing FUTEX_CMP_REQUEUE_PI and FUTEX_WAIT_REQUEUE_PI is trickier because they are more complex and not well documented. I'll assume for now that the futex uaddr forFUTEX_CMP_REQUEUE_PI can only have waiters queued by FUTEX_WAIT_REQUEUE_PI. Scanty documentation suggests this but does not say it explicitly.

Again, the basic idea is to preserve the wait/wake relationships but replace PI futexes with non-PI ones. So, FUTEX_WAIT_REQUEUE_PI would do a FUTEX_WAIT on the uaddr futex. When this wait completes, we continue by trying to lock the uaddr2 futex and waiting on that if we can't. FUTEX_CMP_REQUEUE_PI would check the CMP condition and then just wake all the waiters on the uaddr futex, letting them take responsibility for acquiring the uaddr2 futex. This means there's no actual requeuing, so it's not very efficient, but it should be adequate.

rocallahan commented 10 years ago

This is going to be tricky to implement because FUTEX_WAIT_REQUEUE_PI's implementation needs internal state to tell us whether we are at the FUTEX_WAIT step or the "acquire uaddr2" step.

And of course this all has to be made to work with signals and restarts. It's not really clear to me how the kernel saves and restores the state of complex syscalls that are interrupted by signals and then resumed.

joneschrisg commented 10 years ago

@cgjones, any thoughts?

If you're asking about my thoughts on this continuing to block 2.0, I think that it should. If you're asking about solutions for the bug, I need a little more time to page this back in and understand what you've written here. (I was pretty confident about the original prep_futex_lock_pi() solution, but as we agreed at the time neither of us would bet our life on it ;).)

rocallahan commented 10 years ago

Actually I've got an idea for slicing through this Gordian knot: use preload.c shims to disable pthread PI mutexes, and drop support for PI futexes completely.

joneschrisg commented 10 years ago

Sorry, haven't had a chance to finish catching up yet, but do you happen to know why pulseaudio is using PI mutexes in the first place?

rocallahan commented 10 years ago

I don't really know, but I theorize that the pulseaudio event loop thread wants to run at high priority --- or at least be able to, in some configurations --- and wants to avoid priority inversion when a low-priority thread has taken the event loop lock.

Anyway, rr has never actually supported priority inheritance, and things have basically worked, so I'm not worried about disabling it harder.

rocallahan commented 10 years ago

With the obvious patch, I can record and replay content/media/test no problem.

joneschrisg commented 10 years ago

Where I stop following is at this comment

Yes, it definitely looks the current mechanism will fail if the futex already has the waiters bit set.

It looks like I convinced myself that the transition to FUTEX_WAITERS only happened in the kernel, but you have a case where it happens in userspace? (I'm tempted to just take your word for it, because I don't want to read the futex code again ;).)

rocallahan commented 10 years ago

The setting of FUTEX_WAITERS happens in the kernel.

Here's one example of what I think could happen, in more detail:

Thread A holds the mutex
Thread B does FUTEX_LOCK_PI, sets FUTEX_WAITERS, waits.
Thread C does FUTEX_LOCK_PI. The FUTEX_WAITERS bit is already set, so our Task::futex_wait call to wait for the syscall to proceed past setting the FUTEX_WAITERS bit in fact does not wait at all. Let's suppose due to some scheduling quirk thread C's kernel thread does not actually get around to setting the FUTEX_WAITERS bit yet.
We schedule thread A, which releases the mutex.
Thread A blocks for some unrelated reason.
We schedule thread B, which takes the mutex. The FUTEX_WAITERS bit is not set.
At some point thread C is rescheduled in the kernel and sets the FUTEX_WAITERS bit asynchronously. rr is not aware of this write.

joneschrisg commented 10 years ago

OK, right, gotcha. Blech. I'm glad you already figured out the problem in your clear-FUTEX_WAITERS-bit idea, because that's the first thing that came to my mind :/.

rr-debugger / rr

Divergence replaying recorded Firefox content/media/tests #1248