Closed rocallahan closed 10 years ago
Interestingly, no syscalls in the pulseaudio thread run buffered. I'd like to figure out why that is, but it does mean that the clock_gettime
call before divergence is not special.
It appears that during recording, pulseaudio's poll_func
's call to pthread_mutex_unlock
performed a FUTEX_UNLOCK_PI
which we recorded in the trace. During replay, that call to pthread_mutex_unlock
does not do any futex operations. Divergence is detected when we do a poll
syscall instead of the futex
syscall.
pthread_mutex_unlock
would only do the futex
syscall if the mutex's __lock
field indicated waiters by having its high bit set. During replay the bit is definitely not set. I wonder how it could be set during recording. The trace shows that a little earlier, we did a FUTEX_UNLOCK_PI
on the same mutex from the same thread (looks like the previous iteration of poll_func
) and there were no context switches between that futex
call and the new one, so how could there be contention?
Aha! Further back in the trace another thread (say T1) calls FUTEX_LOCK_PI
on this futex, and waits. Then we switch to the pulseaudio event loop thread, which finishes its own FUTEX_LOCK_PI
(I guess some third thread released the lock) to get the lock. Then that thread does some processing, releases the lock (but doesn't yield), does a poll(), presumably takes the lock without a syscall, does more work, and releases the lock again. This second release sees waiters.
I suspect we have here some kind of race where the kernel sets the waiters bit at a bad time. I have a bad feeling I'll need to understand the comment on prep_futex_lock_pi
...
The pulseaudio thread's first release is done with a FUTEX_UNLOCK_PI
which indicates the waiters flag was set. After the release, the lock is fully unlocked. I think one possibility for what could have happened is that thread T1's system call tries again to acquire the lock --- after pulseaudio has taken it again. That would set the waiters bit again behind rr's back.
I'm still pondering the comment on prep_futex_lock_pi
, but I think the analysis there doesn't consider the possibility that, after rr has set the waiters bit, thread A releases and reacquires the futex, clearing the waiters bit and allowing the kernel's setting of the bit to be observed.
Yes, it definitely looks the current mechanism will fail if the futex already has the waiters bit set. We'll "wait for the syscall to set the bit" but since it's already set, our wait will do nothing and the kernel is free to go ahead and set it "later".
We could clear the waiters bit before we allow a contented FUTEX_LOCK_PI
syscall to proceed . That will let us reliably "absorb" the kernel's waiters-bit write. The problem then is that after a thread calls FUTEX_UNLOCK_PI
, multiple kernel threads could be woken up, one of which takes the lock and one of the others sets the waiters bit (again). That write will not be observed by rr.
I haven't got any good ideas here. Needs more thought.
The fallback option is to tackle issue #674 and emulate futexes in rr. I think that would simplify things overall, though we might lose syscallbuf support for FUTEX_WAKE. We probably don't need SYSEMU-during-recording trickery; we can neutralize futex syscalls by munging syscall parameters.
I think the most sane approach here is going to be to emulate all priority-inheriting futex operations in rr, leaving WAIT and WAKE alone. As a first cut we can ignore priority inheritance completely. Priority inheritance is not commonly used so I'm not worried about perf impact.
Getting signal handling right will be tricky.
I don't like this to block 2.0 but I think generally it's a good idea for diagnosed divergence bugs to block a release, since they're hard to diagnose and if someone hits this in the wild a lot of effort would be required to figure it out.
@cgjones, any thoughts?
I'm trying to write a testcase for this and failing. I have multiple threads all trying to get the same PI mutex. Whenever the owning thread does a FUTEX_UNLOCK_PI
, the kernel immediately gives the mutex to one of the waiters and sets the waiters bit due to the existence of more waiters, so when we record the new value at *uaddr it always has the waiters bit set already. I don't know how to reproduce the situation observed in this bug where a kernel thread delays setting the waiters bit. There may not even be a reliable way to reproduce it.
I've written code to implement FUTEX_LOCK_PI
by trying to get the lock and if we fail, doing a FUTEX_WAIT
instead followed by a retry of the FUTEX_LOCK_PI
syscall. FUTEX_UNLOCK_PI
is translated to a FUTEX_WAKE
. The invariant is that if, in a real execution, a thread has a FUTEX_LOCK_PI
waiting on futex F, in rr recording the thread is FUTEX_WAIT
ing on F. This seems to work OK in recording; the replay code isn't written yet. It's quite clean because we don't actually have to track waiters; the kernel does it for us.
Implementing FUTEX_CMP_REQUEUE_PI
and FUTEX_WAIT_REQUEUE_PI
is trickier because they are more complex and not well documented. I'll assume for now that the futex uaddr
forFUTEX_CMP_REQUEUE_PI
can only have waiters queued by FUTEX_WAIT_REQUEUE_PI
. Scanty documentation suggests this but does not say it explicitly.
Again, the basic idea is to preserve the wait/wake relationships but replace PI futexes with non-PI ones. So, FUTEX_WAIT_REQUEUE_PI
would do a FUTEX_WAIT
on the uaddr
futex. When this wait completes, we continue by trying to lock the uaddr2
futex and waiting on that if we can't. FUTEX_CMP_REQUEUE_PI
would check the CMP condition and then just wake all the waiters on the uaddr
futex, letting them take responsibility for acquiring the uaddr2
futex. This means there's no actual requeuing, so it's not very efficient, but it should be adequate.
This is going to be tricky to implement because FUTEX_WAIT_REQUEUE_PI
's implementation needs internal state to tell us whether we are at the FUTEX_WAIT
step or the "acquire uaddr2
" step.
And of course this all has to be made to work with signals and restarts. It's not really clear to me how the kernel saves and restores the state of complex syscalls that are interrupted by signals and then resumed.
@cgjones, any thoughts?
If you're asking about my thoughts on this continuing to block 2.0, I think that it should. If you're asking about solutions for the bug, I need a little more time to page this back in and understand what you've written here. (I was pretty confident about the original prep_futex_lock_pi()
solution, but as we agreed at the time neither of us would bet our life on it ;).)
Actually I've got an idea for slicing through this Gordian knot: use preload.c
shims to disable pthread PI mutexes, and drop support for PI futexes completely.
Sorry, haven't had a chance to finish catching up yet, but do you happen to know why pulseaudio is using PI mutexes in the first place?
I don't really know, but I theorize that the pulseaudio event loop thread wants to run at high priority --- or at least be able to, in some configurations --- and wants to avoid priority inversion when a low-priority thread has taken the event loop lock.
Anyway, rr has never actually supported priority inheritance, and things have basically worked, so I'm not worried about disabling it harder.
With the obvious patch, I can record and replay content/media/test no problem.
Where I stop following is at this comment
Yes, it definitely looks the current mechanism will fail if the futex already has the waiters bit set.
It looks like I convinced myself that the transition to FUTEX_WAITERS only happened in the kernel, but you have a case where it happens in userspace? (I'm tempted to just take your word for it, because I don't want to read the futex code again ;).)
The setting of FUTEX_WAITERS
happens in the kernel.
Here's one example of what I think could happen, in more detail:
FUTEX_LOCK_PI
, sets FUTEX_WAITERS
, waits.FUTEX_LOCK_PI
. The FUTEX_WAITERS
bit is already set, so our Task::futex_wait
call to wait for the syscall to proceed past setting the FUTEX_WAITERS
bit in fact does not wait at all. Let's suppose due to some scheduling quirk thread C's kernel thread does not actually get around to setting the FUTEX_WAITERS
bit yet.FUTEX_WAITERS
bit is not set.FUTEX_WAITERS
bit asynchronously. rr is not aware of this write.OK, right, gotcha. Blech. I'm glad you already figured out the problem in your clear-FUTEX_WAITERS-bit idea, because that's the first thing that came to my mind :/.
Following on from #1241. I can record content/media/tests but replay diverges quite late in the run.
We diverge very soon after a
clock_gettime
syscall. This syscall would normally be buffered but in this case it was not, apparently because the syscallbuf is locked.