Closed mratsim closed 4 years ago
No comment ...
No more condvar on Linux #58
And for posterity, I confirm that there are no lost signals in Windows condition variables.
Seems like I'm not alone: https://software.intel.com/en-us/forums/intel-fortran-compiler/topic/720611
Seems like there is now an issue upstream affecting C#, Python, OCaml:
And someone reimplemented glibc condition variable in TLA+ to prove that they were buggy:
Someone pointed me to this issue and it was interesting to me because I was one of the original participants in the discussion around the glibc bug (https://sourceware.org/bugzilla/show_bug.cgi?id=13165) it sounds like, and apparently there's allegedly a new similar glibc bug. However, I'm pretty sure that was not the issue, and that the problem here was a not a glibc bug. Observing any bug of that form (stolen signals), requires the signal to take place with the lock held; otherwise, there is no ordering relationship between the signal and newly arriving waiters that might "steal" the signal, and thus no way to designate it stolen. In the code shown here, the signal takes place without the lock held.
I'm not sure if the new futex code you have is correct or not; with backoff and retry it probably works automatically though. But the cond var code is almost certainly wrong. If you need to know that signal will wake a waiter that arrived before the event you're signaling, the change to state and signal operation must take place together under a single instance of the lock being held.
As an extra data point, the new glibc bug https://sourceware.org/bugzilla/show_bug.cgi?id=25847 is not reproducible on musl (where the current condvar implementation, dating back to 2014, was specifically designed around getting the stolen wakes issue right, as well as self-synchronized destruction right), but as I understand it, you hit the problem described above with both glibc and musl.
I'm that someone. I was at work at the time and didn't really have the time to study the code or the TLA+ model. I just sort of assumed you know what you're doing and dropped a link to this bug on IRC for the musl folk to check out.
I now took a look at the PlusCal code and it is clear that it's not simulating a condition variable correctly. Signal should be a no-op if nobody is waiting, but in your spec it simply sets a boolean variable which will eventually awaken a waiter whether they were waiting at the time of signal or not.
Try something like this and TLC will find a deadlock:
diff --git a/formal_verification/event_notifiers.tla b/formal_verification/event_notifiers.tla
index aaae45f..0ff4bd8 100644
--- a/formal_verification/event_notifiers.tla
+++ b/formal_verification/event_notifiers.tla
@@ -36,6 +36,7 @@ variables
ticket = FALSE; \* a ticket to sleep in a phase
signaled = FALSE; \* a flag for incoming condvar signal
+ condWait = 0;
condVar = FALSE; \* Simulate a condition variable
msgToParent = "None"; \* Simulate a worksharing request to the parent
signaledTerminate = FALSE; \* Simulate a termination task
@@ -88,8 +89,10 @@ procedure park()
begin
NotSignaled2: if ~signaled then
StillValid: if ticket = phase then
+ StartWait: condWait := condWait + 1;
Wait: await condVar; \* next line is atomic
condVar := FALSE; \* we don't model the spurious wakeups here
+ condWait := condWait - 1;
end if; \* but they are not a problem anyway
end if;
Reset: signaled := FALSE;
@@ -104,7 +107,9 @@ procedure notify()
end if;
Notify: signaled := TRUE;
InvalidTicket: phase := ~phase;
- Awaken: condVar := TRUE;
+ Awaken: if condWait > 0 then
+ condVar := TRUE;
+ end if;
RET_Notify: return;
end procedure;
As @richfelker said, you really should use that lock. It's not there just for show, and you can't dance around it using atomics.
Thank you for the TLA condvar
As @richfelker said, you really should use that lock. It's not there just for show, and you can't dance around it using atomics.
I'm not sure what you mean by dance around with atomics, this also deadlocks despite using the lock
I'm not sure what you mean by dance around with atomics, this also deadclocks despite using the lock
I mean that the waiter must both check the state variable and call wait while while holding the lock. If you don't, there's always a race between checking the state variable and entering wait (no matter how much you use atomics and fences). Of course the signaller must modify that state variable only while holding the lock.
If you think about the condvar api, it wouldn't make any sense for them to require you to just wrap the wait/signal calls with a lock. If that is all they need from you, then they could very well grab the lock inside the library for you and you wouldn't ever need to worry about it. But simply wrapping the call is not enough, you must also protect the controlling state variable with a lock. That is why it's your responsibility to use the lock.
I don't know if that explanation makes sense but consider this pseudocode:
park() {
lock.acquire();
if (shouldWakeUp) {
shouldWakeUp = FALSE;
lock.release();
return;
}
// if we didn't have the lock here, shouldWakeUp could change to TRUE right now
// and the notifying thread could signal before we enter wait(). That signal is effectively
// lost.
cond.wait(lock);
lock.release();
}
notify() {
lock.acquire();
shouldWakeUp = TRUE;
cond.signal(); // it should be safe to signal after the release too
lock.release();
}
It is critical that the condition for waking up is checked while the lock is held, before entering wait. That's what the lock is there for.
The whole point of my "2-phase commit to sleep" protocol is to avoid holding the lock for a long time.
There is work to do in between the shouldWakeUp
check and cond.wait(lock)
that requires sending messages to other threads.
Your proposal might deadlock my runtime if threads are locking each of their event notifiers.
You still must check the predicate again with the lock held before waiting on the condvar, and there must be an invariant that other threads cannot modify the data the predicate depends on without holding the lock.
You don't have to "hold the lock for a long time". You should generally do nothing but take the lock, make sure nothing has changed that might require re-inspection, and then perform the condvar wait. Long-running operations can be done outside that.
Ah I see what you mean.
In the current code this is ensured by the phase and ticket system: https://github.com/mratsim/weave/blob/a230cce98a8524b2680011e496ec17de3c1039f2/weave/cross_thread_com/event_notifiers.nim#L67-L94
This works fine with the Futex code. I wonder why I didn't get any deadlocks with Windows or Mac condition variables. In particular Windows was using the same machine so it's possible that their signaling works on different assumption than POSIX.
So I'm in conflict:
As an arbiter, OSX C standard library.
Those are example logs of unfortunate fates of my runtime:
Plenty of people suggested on stack overflow that you should use locks, you should unlock before signaling, you should unlock after signaling. However the evidence seem damning, Glibc sits on the box of the accused. OSX does not exhibit the same behaviour. Musl does though. Formal verification explored over almost 10 millions of possible state interleaving and did not find a deadlock or livelock in my event notifier:
So what does Glibc has for defence?
While waiting for a fix that may never come especially in distros that upgrade very slowly, let's review our options:
pthread_mutex_lock(&m); pthread_mutex_unlock(&m);
pthread_cond_signal(&c);