Closed cconstantine closed 1 year ago
This is working as intended.
There's nothing preventing the second thread from proceeding to the next loop iteration after it wakes up the main thread via notify_one
. If you intend the main thread to never miss an event, the second thread must wait until the main thread confirms it has received it. The non-deterministic behavior you see is a result of thread scheduling behavior of your OS.
This is working as intended.
I'm having a hard time with this answer. Condvar is behaving in a way that violates my understanding of the locking behavior around condition variables, but the docs for it don't make it strictly wrong and I can't find anything that describes condition variables without platform/language specifics.
If this is working as intended, then how do you avoid missing real wakes in the presence of spurious wakes?
This is not a bug; it is working exactly as a condition variable should. A condition variable does not track the number of 'pending notifications' or any other state like that. Condvar::notify_one() does not wait or block until a corresponding Condvar::wait().
The fact that your program does terminate shows that no important notifications are actually missed: the main thread will always end up being awoken to observe the value 100. There is no guarantee that it'll observe every single value before that, since the background thread might be able to run several (or even all 100) iterations of its loop before the main thread checks again.
See https://marabos.nl/atomics/basics.html#condvar for my attempt at explaining how a condition variable is used. Hope that helps.
Thank you for the response, and I don't disagree with you closing this issue.
A condition variable does not track the number of 'pending notifications' or any other state like that. Condvar::notify_one() does not wait or block until a corresponding Condvar::wait().
This makes sense, and is what I would expect. I'm not expecting to get every notification because there is some kind of storage of things that might wait, but I am expecting waiting threads to not miss a notified value.
What I'm expecting is that:
1) The thread waiting on a condition variable is either awake with the mutex locked, or waiting with the mutex unlocked. Never awake with the mutex unlocked.
2) A .notify_one()
/.notify_all()
to immediately wake (a) waiting thread(s) to see the notified value before the mutex can be locked again.
The fact that your program does terminate shows that no important notifications are actually missed: the main thread will always end up being awoken to observe the value 100.
This is not entirely true. The program does terminate every time, but that is due to the fact that I'm using .wait_timeout()
and the test thread is checking for the last value the background thread will set. If the test thread were looking for any other value, the program would panic because, after missing that value, it would eventually wait for the next value after the thread has exited and assert!(!timeout.timed_out());
would panic.
I believe the issue is that the mutex is not locked atomically with the waiting thread waking. As your article details, the lock is released atomically with the thread waiting, but because the lock is not acquired atomically with the wake it is possible for another thread to lock the mutex before the Condvar. This results in the Condvar waiting on 2 conditions; 1) the futex to trigger a wake, and then 2) the mutex being locked. This can make the Condvar effectively miss notifications.
Here's a rust-ish pseudocode example run-trace to illustrate the issue (splitting the mutex unlocking/locking from the blocking/waking): | Thread 1 | Thread 2 |
---|---|---|
let mutex_lock = lock(mutex) | let mutex_lock = lock(mutex) |
||
assert_eq!(*mutex_lock, 0); |
waiting because Thread 1 got the lock | |
condvar.wait_unlock(mutex_lock) // unlocking mutex |
waiting | |
condvar.wait_block(mutex_lock) // blocking thread |
Still waiting because the unlock/wait is atomic | |
waiting | *mutex_lock += 1; |
|
waiting | condvar.notify_one() |
|
waiting | drop(mutex_lock); // Unlocking |
|
condvar.wait_wake(); | let mutex_lock = lock(mutex) |
||
condvar.wait_lock(mutex); | *mutex_lock += 1; |
||
waiting because Thread 2 got the lock | drop(mutex_lock); // Unlocking |
|
assert_eq!(*mutex_lock, 1); |
... | |
panics because *mutex_lock == 2 | ... |
It could also happen if there are 3 threads; 1 waiting, and 2 attempting to increment the mutex value:
Thread 1 | Thread 2 | Thread 3 |
---|---|---|
let mutex_lock = lock(mutex) | let mutex_lock = lock(mutex) | let mutex_lock = lock(mutex) |
||
assert_eq!(*mutex_lock, 0); |
waiting because Thread 1 got the lock | waiting because Thread 1 got the lock |
condvar.wait_unlock(mutex_lock) // unlocking mutex |
waiting | waiting |
condvar.wait_block(mutex_lock) // blocking thread |
Still waiting because the unlock/wait is atomic | Still waiting because the unlock/wait is atomic |
waiting | *mutex_lock += 1; |
waiting |
waiting | condvar.notify_one() |
waiting |
condvar.wait_wake(); | drop(mutex_lock); // Unlocking |
waiting | |
condvar.wait_lock(mutex); | ... | *mutex_lock += 1; |
||
waiting because Thread 3 got the lock | ... | condvar.notify_one() |
waiting | ... | drop(mutex_lock); // Unlocking |
assert_eq!(*mutex_lock, 1); |
... | ... |
panics because *mutex_lock == 2 | ... | ... |
Here's a rust playground illustrating the issue using 2 background threads: https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=0dda887121bc061878a94256e344ceb5
The internet is pretty sparse on the details of the mutex locking when the waiting thread wakes. The only docs I can find that talk about it are the C++ docs I linked above that says (emphasis mine): "Call wait, wait_for, or wait_until on the std::condition_variable (atomically releases the mutex and suspends thread execution until the condition variable is notified, a timeout expires, or a spurious wakeup occurs, then atomically acquires the mutex before returning)".
If rust is not attempting to reproduce that behavior (which is fair), would it make sense to add something to the Condvar docs to make it clear that the mutex is not locked atomically upon waking?
While working with
std::sync::Condvar
to create a threaded syncronization system I think I've encountered a situation where a Condvar can miss wake ups triggered by.notify_one()
and.notify_all()
when waiting for a specific condition to be met.I suspect this is caused by the
.wait()
and.wait_timeout()
methods not atomically unlocking/locking the mutex when waiting / waking. The docs for Condvar state: "This function will atomically unlock the mutex specified (represented by guard) and block the current thread." Of note, it doesn't state something similar for locking the mutex and waking (The C++ docs do indicate this). I think it is necessary for the the mutex to be atomically locked with the thread waking, otherwise it will be possible to miss important notifies when spurious wakes happen, or if the desired condition isn't met after the firstnotify
is called. I haven't been able to confirm this is the problem, or fix it in the rust implementation of condition variables.I tried this code: https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=f48827f25042ae134724922b90055385
I expected to see this happen:
I expect to see alternating
dbg!
outputs of*outer_lock
's value from the test thread, and&i
's value from the background thread. Then, afterLOOP_TO
iterations thewakes
variable should equalLOOP_TO
. This would mean that the test thread woke for everynotify_one()
in the background thread.Instead, this happened:
Not ever time, but frequently, there are missing outputs from
dbg!(*outer_lock)
, andwakes
is less thanLOOP_TO
. This indicates to me that the test thread missed triggers tocvar.notify_one()
. Adding sleeps can make this test reliably pass and removing thedbg!
statements will make it more likely to happen, indicating to me that the condition variable and mutexs aren't completely eliminating a race condition of some kind.System information:
Discussion
Before making this report I asked about this behavior in the #contribute channel in Discord here.