rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
95.41k stars 12.29k forks source link

Tracking issue for improving std::sync::{Mutex, RwLock, Condvar} #93740

Closed m-ou-se closed 1 year ago

m-ou-se commented 2 years ago

A few weeks ago, on January 12th, the @rust-lang/libs team discussed a plan to improve std::sync::Mutex, std::sync::Condvar, and std::sync::RwLock.

On most platforms, these structures are currently wrappers around their pthread equivalent, such as pthread_mutex_t. These types are not movable, however, forcing us to wrap them in a Box, resulting in an allocation and indirection for our lock types. This also gets in the way of a const constructor for these types, which makes static locks more complicated than necessary.

@Amanieu presented his library, parking_lot, which implements the 'parking lot' structure from WebKit. Rather than letting the operating system handle any waiting queues, this manages the queues in user space in a global hash table indexed by the address of the lock. This allows for flexible 1-byte mutexes with almost no restrictions.

There has been an attempt at merging parking_lot into std, as a replacement for the implementation of std's locking primitives. However, many different blockers came up and that PR did not get merged. (See also my RustConf talk for some of this history.) Many of these blockers have been resolved since.

One of the problems with replacing std's lock implementations by parking_lot is that parking_lot allocates memory for its global hash table. A Rust program can define its own custom allocator, and such a custom allocator will likely use the standard library's locks, creating a cyclic dependency problem where you can't allocate memory without locking, but you can't lock without first allocating the hash table.

A possible solution is to use a static table with a fixed number of entries, which does not scale with the number of threads. This could work well in many situations, but can become problematic in highly parallel programs and systems. Another possible solution is to skip the global allocator and directly allocate this hash table with a native system API such as mmap on Unix. (See this thread for some discussion.)

In our meeting, we discussed what's natively available on various operating systems/platforms:

After some discussion, the consensus was to providing the locks as 'thinnest possible wrapper' around the native lock APIs as long as they are still small, efficient, and const constructible. This means SRW locks on Windows, and futex-based locks on Linux, some BSDs, and Wasm. And for other platforms without suitable native APIs, a parking_lot-based implementation using one of the possible workarounds for the allocation problem.

This means that on platforms like Linux and Windows, the operating system will be responsible for managing the waiting queues of the locks, such that any kernel improvements and features like debugging facilities in this area are directly available for Rust programs.

It also means that porting the standard library to a new platform will be easier, and maintainance of the std implementation for the less common platforms will become easier. These platforms will now only have to implement a thread parker and can rely on a performant and correct implementation of the standard locks on top of that.

Update: We've decided to not directly use parking lot's one-byte (two bit) locks, but instead use the equivalent of its internal WordLock or Windows' SRW locks, which are one pointer in size and require no global state in the program. That solves the allocation problem.

Update 2: To maintain priority inheritance of the native locks (e.g. on macOS), we've kept using pthread locks on non-futex unix platforms, at least for now. Using https://github.com/rust-lang/rust/pull/97647, we've made it possible for the locks to have const fn new.


Primary goals:

Possible later goals:


To do:

hkratz commented 2 years ago

Did you investigate using os_unfair_lock (Apple Developer Docs) on macOS? It can't be moved once it is in use either but it is essentially just a struct with a u32 in it, initialized to 0 before use. But maybe I am missing some requirements here. If it seems possible, investigation of it could be added as a subtask.

It is available since macOS 10.12 on x86-64, but could be used unconditionally on aarch64.

Nevermind it does not have timeouts for os_unfair_lock_lock/os_unfair_lock_trylock, so it is obviously out.

nbdd0121 commented 2 years ago

I don't think we need to use hash tables like parking lot for targets without futex support. We could just store the queues inside Mutex. I think the only downside is size, but that'll still be better than the current boxed pthread mutex though.

Amanieu commented 2 years ago

Nevermind it does not have timeouts for os_unfair_lock_lock/os_unfair_lock_trylock, so it is obviously out.

Also it doesn't support Condvar, which is required for Rust locks.

thomcc commented 2 years ago

Oh. This is a subject extremely near-and-dear to my heart -- I even started on a pre-RFC for atomic wait/wake ops, like the ones that C++ got recently. That said, much of my motivation was hoping it would find a way to unstick us here, and I this is clearly a higher priority.

For futex APIs, I wrote a blog post on them at one point. https://shift.click/blog/futex-like-apis I wrote about some of the options here for system APIs, which might be helpful as reference.

That said, I got asked about Darwin's __ulock_* API in a Zulip DM, since I wrote that post as well as hte ulock-sys crate.


For Darwin, it I think it's quite unlikely we could use __ulock_{wait,wake} for a futex API directly without risking... many problems. The fact that it may break in the future is actually one of the less concerning issues, in comparison to the fact that that use of private APIs can cause rejection from app stores, which would be a problem for Rust code that on the iOS/Mac App Stores (for clarity: I'm just going on a hunch[^stores] that could happen for __ulock_*, I do not know for certain, but I think libstd needs to favor caution here).

That said, there's... a pretty cursed option for a workaround on these targets. Too cursed to use? Not sure... possibly.

That is: libc++'s implementation of std::atomic<u32>::wait, notify_one, and notify_all will ultimately bottom out to calls to __ulock_wait/__ulock_wake, which... I guess they get to use because libc++ gets shipped and updated with the OS on Darwin, fair enough. Anyway, in theory we could just call that somehow. And there's a sense in which this would be pretty nice, since it's a system-provided function that implements (something similar to) the API we want... But it's certainly not ideal for a few reasons:

  1. It's unclear to me how much we promise about what we link to on various platforms, but it's possible we cannot do this, and likely that we do not want to... I don't know if we make promises about what we need to link against for libstd, but suddenly dynamically linking against the C++ standard library on some platforms... feels like it would be a change.

  2. Not to mention -- even if we are allowed to do this, it's still a pain in the butt: The version of libc++ that provides this this is much newer than our platform guarantees, so we'd need to weak-import[^dl] the dubiously-public[^abi] symbol that this stuff under the hood (although I suppose this is better than requiring a C++ compiler to build Rust code for these platforms).

So, maybe? Given that the OS provides libc++, it seems like it should technically be fine (unless we specifically say we won't). At least, so long as we manage to do it in a way that doesn't versions older than we care about. But maybe I've missed something (or I haven't, but it's still several bridges too far).

(If not, oh well, someday an API shaped mostly like that is bound to end up as part of the libc someday, after it's copied to C from C++ in C9Z 😛)

(EDIT: Thinking about it more, it's fairly likely we should have some justification for this approach, since looking at the libc++ code, enough happens between what our all would be that it could defeat the point. And the effort spent here to make it work would be better spent on improving our fallback)


All that said, there are a few options on various platforms that could probably help implement the fallback thread parker, in case there are regressions (which plausibly there wouldn't be if we use parking_lot_core as-is, but maybe there would be if we need to abandon the growth behavior). In particular:

That said, a lot of this probably should be in the a tail of followups, and only if it's worthwhile.

[^more]: That is, at worst the single-piece semaphore could just be implemented in terms of a private condvar and mutex pair, resulting in the same performance. That said, relying too heavily on this style of logic is a good way to get unpleasant surprises, so who knows.

[^stores]: I mean, if nothing else they start with __ and don't appear in a single public header (they only appear in the headers inside apple source repos). So, even though these APIs don't seem to be in the blacklists that prevents you from even attempting app store submission (which is not a guarantee), it seems completely out of the question to use in libstd (even if I do think they're cool, and I do).

[^dl]: Calls to dlopen/dlsym tend to be disallowed in the same situations, since tooling can't tell if you're using it to use private APIs. (We already do this all over libstd though, so maybe it's not the issue it once was)

[^abi]: Which isn't API-compatible, but presumably has to stay around for ABI reasons. I mean, it must, right? C++ standard libraries are famously resilient to ABI issues, after all 😅...

[^hermit]: For example, the load-then-sub in notify_one is seems like it has a race that could lead to a deadlock, but there may be a reason this is impossible -- I've never spent that long thinking about it, just seems dubious, and like the kinda code that'd be real nice to replace with something that (if nothing else) will get more attention.

bjorn3 commented 2 years ago

On both Linux and some BSDs there's a native futex() syscall, which provides functionality similar (but more primitive) than a parking lot. Nearly equivalent functionality is available in Wasm. While not trivial, it's possible to implement our synchronization primitives directly based on such a futex. This makes all the primitives small (32-bit), efficient, and const constructible.

Please make sure you implement a fair or eventually fair (this is what parking lot does and needs the hashmap for) lock to prevent starvation.

nbdd0121 commented 2 years ago

I don't think pthread mutexes are fair so the new implementation shouldn't be required to be fair either.

thomcc commented 2 years ago

SRWLOCK is also unfair too, pthread_mutex_t depends (POSIX doesn't require it, apple has it, I believe glibc doesn't)... So this would mean we need to a parking lot everywhere, even if the OS has otherwise good locks (like with Window's SRWLOCK).

More generally: There are a bunch of locking features that would be nice to have in an ideal world, but I think they start becoming contradictory, or only achievable easily on one platform or another -- for example, I believe the parking-lot style of lock won't respect OS priorities (unless there's a clever trick I don't know about or something). Trying to solve all of this probably has no solution, and I'm not sure that we're in the right position decide which things we want to promise yet[^1]... assuming we even want to make these sorts of promises (and I'm not sure we do).

[^1]: I feel like the place for that is afterwards we ship it, if/when bugs come in from users and such. Or at least when we're further in the implementation.

hkratz commented 2 years ago

Apple switched the default from PTHREAD_MUTEX_POLICY_FAIRSHARE_NP to the newly introduced PTHREAD_MUTEX_POLICY_FIRSTFIT_NP in macOS 10.14 (iOS and tvOS 12.0, watchOS 5.0), so they are not fair for recent macOS either (by default).

nagisa commented 2 years ago

Wouldn't implementing lock algorithms on top of futex et al. effectively mean that both the parking_lot based implementations and these other algorithms receive less test/use coverage overall? Or is there some plan to share the gnarly implementation details between parking_lot based implementations and the futex/etc. ones?

m-ou-se commented 2 years ago

@nagisa Yes. That's an issue with all platform-specific code, I suppose. But we do plan to also have the CI test the generic implementation (parking lot) on Linux as well, so it gets test coverage.

m-ou-se commented 2 years ago

I'm now reading through the various popular lock implementation to see how those are implemented.

Starting with musl.

Mutexes

Musl's pthread_mutex_t is 40 bytes (on 64-bit), although for regular/normal mutexes, only two 32-bit integers are relevant/used: the lock futex and the waiters counter. (Pthread mutexes support various attributes and different types which are all runtime settings, and thus stored inside the mutex itself. Depending on the type, other fields in the mutex are used. In Rust, however, different types are tracked in the type system, not dynamically inside the mutex.)

Its implementation is relatively simple. The lock integer contains just two bits: one to indicate whether the mutex is locked, and one to indicate whether it's contended. The contended bit can only be set when the lock bit it set as well. Locking tries to compare-and-swap 0 to 'locked' (spinning 100 times if necessary), and otherwise blocks on a 'futex wait' on the lock. Before waiting on the futex, the high bit of the lock is set and the waiter count is incremented. After waking (whether spurious or not) the waiter count is decremented.

When unlocking, a 0 is unconditionally written to the lock, and when the high bit was set or the waiter count was nonzero, a one thread is awoken through the futex.

It's not entirely clear to me how useful the waiter count is. This mechanism would also work correctly with only the 'contended'-bit. The waiter count is only used by the spin loop while locking, to avoid spinning and sleep directly when there's other waiters, to avoid stealing the lock in front of others. This makes things a little bit more 'fair'. See commit f5fb20b.

Condition variables

Musl's pthread_cond_t is 48 bytes (on 64-bit), although for 'normal' condition variables, only two pointers and a 32-bit lock/futex is used.

The lock protects a doubly linked list pointed at by the two pointers. This doubly linked list is the waiting queue, where each node is stored on the stack of a waiting thread. Each node contains a state (waiting, signalled, or leaving) and a futex that the thread is waiting on. pthread_cond_signal wakes up the first thread on the list that wasn't signalled yet. pthread_cond_broadcast marks all nodes as 'signalled', but only wakes up one thread. That thread, after woken up (and locking the mutex), will requeue the next signalled thread onto the mutex. That way, there's no need to store a pointer to the mutex in the condition variable, as pthread_cond_broadcast doesn't have to know about the mutex. It only requeues one thread. When that thread is woken up, that one will requeue the next one, and so on.

Reader-writer locks

Musl's pthread_rwlock_t is 56 bytes (on 64-bit), although it only uses two 32-bit integers in the normal case: one lock futex, and one waiters count. Its implementation is very similar to pthread_mutex_t, except 31 bits of the lock futex are used to count the number of active reader locks. The maximum value is used to indicate a writer lock. The other bit is used as the 'contended' bit, and is used for both readers and writers. It does not prioritize writers, allowing more readers to lock the lock even if there's writers waiting. The reason for that is explained in commit 50304f2: posix requires reader locks to be recursive. Proritizing writers could make a second reader lock on the same thread block.

Read-unlocking a contended lock will wake up one thread. Write-unlocking a contended lock will wake up all threads. The latter could cause another writer thread and many reader threads to wake up at the same time, potentially resulting in the writer thread winning the race and all the reader threads going back to sleep again.

The waiter counter has the same purpose as in pthread_mutex_t above.


In general it's interesting to see how many implementation details are relevant to Posix, but not to us here in Rust. For example, commit c68de0b mentions how Posix allows destroying and unmapping a mutex even if another thread hasn't returned from pthread_mutex_unlock yet. In Rust, this is impossible due to borrow rules. And on top of things like that, Posix supports more types of locks (recursive, process-shared, priority-inheriting, ..) all within the same type, which we do not need to support (within the same type).


Tl;dr: Musl uses trivial rwlocks and mutexes, except it separately counts waiters to avoid spinning when there's other waiters. Rwlocks don't prioritize writers because Posix wants reentrant reader locks. Condition variables are not trivial; they keep a waiter queue as linked list in user space, and requeue threads one by one.

m-ou-se commented 2 years ago

Next: glibc's mutexes. (Glibc's condition variables and rwlocks are a bit more complicated. I'll come back to those later.)

Mutexes

Glibc's mutexes are 40 bytes and support a lot of different properties. The 'normal' kind, (PTHREAD_MUTEX_TIMED_NP) however is very simple, and only uses three 32-bit fields: the lock futex, the owner thread id, and the number of users. The last two fields are useful for debugging and dead-lock/misuse detection, but not used for the basic functionality of the mutex. (The n_users field represents the current number of threads that hold the lock (0 or 1), except it doesn't get decremented when pthread_cond_wait temporarily unlocks the mutex, which means that those waiting threads are also included in this number.)

The normal mutex, unlike the many other variants, is very simple. It has three states. 0 for unlocked, 1 for locked without waiters, and 2 for locked with waiters. Locking tries to compare-and-swap 0 to 1, or otherwise swaps in a 2 and futex-wait's for things to change. Unlocking swaps a 0 in place, and futex-wake's one waiting thread if it was set to 2. Unlike musl's implementation, it does not try to spin at any point, and directly switches to sleeping if the compare-and-swap fails.

It's basically identical to the one I wrote in my experiment: https://github.com/m-ou-se/futex-lock-experiment/blob/ab0e9a135aff5f3b2e01218a3281cbc552d76653/src/lib.rs#L30-L109

kprotty commented 2 years ago
thomcc commented 2 years ago

Would os_unfair_lock be possible for darwin systems? It supports priority inheritance, is 32bits large and movable, but is only available on macos 10.12+

Lack of timed wait is a problem, and libstd supports back to 10.7+ (and I'm not sure what's required to reconsider this).

kprotty commented 2 years ago

Ah, fair point for the 10.7+ requirement. Does the new Mutex impl need timed wait? SRWLock and the current API don't support it.

thomcc commented 2 years ago

It's supported via condvar: https://doc.rust-lang.org/std/sync/struct.Condvar.html

kprotty commented 2 years ago

Condvar's wait_timeout functions look like they always return the MutexGuard (excluding when poisoned). This implies that the mutex itself must be locked on return, hence no need for Mutex to internally have a timed_lock(). darwin's and musl's condvar implementations help confirm.

m-ou-se commented 2 years ago

Continuing with glibc's condition variables:

Condition variables

Glibc's condition variables are 48 bytes, and all of them are used in normal operation. Unlike musl's implementation, this one does not keep a queue in user space, as the same implementation is used for process-shared condition variables, which cannot refer to any memory addresses that might be different in different processes.

It keeps a 64-bit sequence number that's incremented by every new waiter, and a second 64-bit number that indicates which of those have been signalled. Within the structure, there are futexes and some other fields for exactly two 'groups' of waiters: G1 and G2. Waiters add themselves to G2, and signals wake up threads in G1. When G1 is empty, signalling first swaps the two groups. This is a workaround for bug 13165, which is about time-traveling signals that end up waking future waiters rather than current waiters.

Two bits of the condition variable represent its internal lock (a trivial three state mutex), which protects the state of the condition variable.

There is some additional complexity about destroying a condition variable while there are still waiters. However, this is irrelevant for our Rust implementation, since we don't allow dropping things that are still in use.

This implementation raises the following questions for a Rust implementation:

m-ou-se commented 2 years ago

More on glibc's condition variables:

Their old design before the bugfix for the time-traveling signals is documented here in pseudocode: https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/DESIGN-condvar.txt;h=4845251c7586c9667cd33c38af3701910d65eaa8;hb=c0ff3befa9861171498dd29666d32899e9d8145b

It keeps a (64-bit) counter for the number of signals that have been sent, and another for the number of threads that have been woken up. The difference between those two is effectively the number of signals ('tokens') that can still be consumed. If a waiting thread wakes up from the futex-wait operation and there are no tokens left (i.e. the counters are equal), it goes back to sleep. That's how a future thread can consume a token and prevent another waiting thread from waiting up from pthread_cond_wait.

tschuett commented 2 years ago

llvm libc for comparison: https://github.com/llvm/llvm-project/blob/main/libc/src/threads/linux/CndVar.h

m-ou-se commented 2 years ago

Finishing up glibc:

Reader-writer locks

Glibc's rwlock is documented here. It is 32 bytes, of which 22 are used during normal operation. It contains two futexes, a readers counter, a writers counter, the thread ID of the current writer, the selected mode, and a flag indicating whether it is process-shared or not. It supports three different modes: a) prefer readers (default), b) prefer writers with recursive reader locks, c) prefer writers without recursive reader locks. Only in the last mode do waiting writers block additional reader locks.

Part of its complexity is due to supporting these different modes within the same structure. Another part of the complexity is due to posix allowing destruction in cases where the Rust borrow checker would not allow it. Both of these are not relevant to our implementation in Rust.

The reader counter also contains three state bits. Using it directly as a futex is avoided, in part because that number can rapidly change as readers come and go. Therefore, one of the state bits (whether the lock is in reader or writer mode) is duplicated in a separate futex.

It also includes a mechanism through which writers can hand over a writer lock to each other without unlocking, which is only used if the lock is set to a mode that prefers writers.

It has nine states, each of which is either part of the 'read phase' (1 through 4a) or the 'write phase' (5 through 8). Even the idle state is split in both a 'read phase' (1) and a 'write phase' (5) idle state. It's entirely symmetrical, except for one more state (4a) in the read phase which is used in the non-recursive writer-preferred mode to block new waiters even when the lock is already read-locked. The transition from one of the idle states to the locked state in the other phase (2 or 7) goes through a intermediary state (3 or 6), during which a thread trying to lock the lock in the 'current phase' (read/write) can prevent the phase change and steal the lock. (This intermediary state is not used for the 'try lock' functions.) The locked states (2 and 7) are split into two states where one indicates there are waiters of the opposite phase (4/4a and 8).

Below is a state transition diagram of this lock. The dashed arrows are only used if writers are preferred. The thin arrows are only used by the 'try lock' functions.

glibc-rwlock-states drawio

The thread ID of the current writer that's stored inside the lock is only used for debugging and deadlock detection when the holder of a write lock tries to acquire a read lock.

m-ou-se commented 2 years ago

Next up: Wine

We don't have the source code to Microsoft's SRW locks and condition variables, but we do have the source code to Wine, which implements these as well. However, while they have to be the same size as Microsoft's implementation, they are implemented very differently and you'll see very different bit patterns in them if you inspect them while they are in use.

Reader-writer locks (and Mutexes)

On Windows we use SRW locks for both RwLock and Mutex, so there's no separate mutex implementation we're looking at. They are the size of a pointer, but Wine's implementation only uses 32 bits, regardless of pointer size. It is split into two 16-bit counters: The number of active reader locks ('owners'), and the number of waiting writers. The reader/owner counter is set to u16::MAX when it is write-locked.

The lock and unlock functions are very simple. Write-locking increments the waiter counter (using a 16-bit operation), and then uses a 32-bit CAS operation on both counters at once to acquire the lock if the owner counter is zero by setting it to u16::MAX and decrementing the waiter counter again. If the owner counter wasn't zero, it'll do a futex[^1] wait to wait until it is.

[^1]: They are not called 'futexes' on Windows, but the operations are identical: WaitOnAddress, WakeByAddressSingle, and WakeByAddressAll.

Read-locking uses a CAS operation to increment the number of owners only if there's no waiting writers and the lock isn't write-locked. So, waiting writers block new readers. If the conditions aren't met, it futex-wait's until they are.

Interestingly, two different addresses are used within the SRW lock as two different futexes. Readers wait on the address of the SRW lock itself and use the full 32-bit value, while writers wait only on the second half which contains the 16-bit owner counter. Unlocking the last reader lock will wake a single thread from the writer queue. Unlocking a writer lock will wake a single writer if there was any waiting (indicated by the waiting counter), or wake all readers if there are no waiting writers.

Mixing 16-bit and 32-bit atomic operations on overlapping memory is not common, and the Intel manual (vol. 3A) seems to suggest this is not a great idea:

Software should access semaphores (shared memory used for signalling between multiple processors) using identical addresses and operand lengths.

Also, surprisingly, no overflow checks are done. If either of the counter overflows, unexpected things happen.

Condition variables

Wine's condition variables are trivial. Just like SRW locks, they are the size of a pointer, but Wine's implementation only uses 32-bits. It is simply a 32-bit counter that gets incremented on every notification. It does not keep track of which lock it is used together with, nor does it make use of requeueing. It does not keep track of any information to prevent a futex-wake operation when there's no thread waiting, etc. It is identical to Condvar1 in my experiment.

m-ou-se commented 2 years ago

Next up: Windows

Now this one is a bit tricky because the source code isn't available, but I made some guesses and have good reason to believe they are accurate.

Reader-writer locks (and Mutexes)

On Windows we use SRW locks for both RwLock and Mutex, so there's no separate mutex implementation we're looking at. SRW locks are one pointer in size, of which the last four bits encode four flags. The other bits either represent a (60-bit or 28-bit) counter, or they represent a (64-bit or 32-bit) pointer with the last four bits missing. That pointer points to something 16-byte aligned, to make sure the last bits are always 0.

When there's no contention, the number of active read locks is stored in the counter. For a write lock, the counter is 0. The least significant bit is used to indicate whether the lock is locked (regardless of read or write mode). An unlocked lock is entirely zero.

When there are threads waiting, that is indicated by one of the four flag bits, and the counter is no longer used and instead used as a pointer to a doubly linked list of waiting threads. The nodes of that list are stored on the stack of the waiting threads. Each node also contains a pointer to the last node in the queue, so you can quickly get there from the first pointer. (The lock only stores a pointer to the first node.)

The last node contains the number of active read locks, since the pointer took the place of that counter within the lock itself.

Each node contains the thread id of the waiting thread, and an additional flag that indicates whether the thread is waiting to acquire a read or write lock.

The thread id is used with the undocumented NtWaitForAlertByThreadId and NtAlertThreadByThreadId APIs. These seem to be effectively identical to Rust's thread::park() and Thread::unpark() functions. So, SRW locks are not futex-based, but thread parker based.

I'm assuming some of the flag bits in the lock are used as a mutex to protect modifications to the queue.

A waiting writer on a read-locked lock prevent new readers from acquiring the lock, to avoid writer starvation, but it doesn't exactly prefer writers in general: it ~roughly attempts to handle the requests in order. SRW locks are not fair.

Like most lock implementations, it first spins a while before going to sleep, to avoid expensive syscalls when the lock is only locked for a very short time. Spinning seems configurable through some global. It also seems to support an alternative to spinning based on the monitorx and mwaitx instructions.

Condition variables

Condition variables on Windows are also a single pointer. They work very similar to SRW locks, storing a pointer with flag bits in the least significant bits, with the pointer pointing to a linked list over waiting threads. The nodes even have the same structure as the nodes used by SRW locks.

WakeAllConditionVariable iterates over the entire list and wakes all threads. It does not requeue waiters to the relevant lock.


Here's a diagram showing an SRW lock in different states:

srw-locks drawio

A condition variable looks nearly identical, except for the locked states and the four flag bits.


tl;dr: SRW locks and condition variables on Windows are thread parker based and keep their waiting queue as a linked list in user space through nodes on the stack of the waiting threads. Thread parking is natively supported on Windows through an undocumented API. It prefers writers to avoid writer starvation. They spin before sleeping, and can use the special monitorx and mwaitx instructions as an optimization.

m-ou-se commented 2 years ago

And then we have: boost (on Posix)

Mutexes

Boost's mutexes (boost::mutex) on Posix platforms are simply a wrapper around pthread_mutex_t. (So, using glibc, these are 40 bytes.) (Implementation here.)

Condition variables

Boost's condition variables (boost::condition) on Posix platforms are a wrapper around pthread_cond_t, but with an additional pthread_mutex_t. (So, using glibc, these are 88 bytes.) The mutex is used to allow for interruptions/cancellation. (Implementation here.)

Waiting on a condition variable is an interruption point. However, if waiting would result in pthread_cond_wait on the mutex that the user wants to wait on, boost would not be able to interrupt that when that mutex is still locked, since pthread_cond_wait will only return after locking that mutex. So, instead, boost's condition variable waits not on the mutex the user provided, but on its own internal mutex. To avoid race conditions, that mutex is locked on every operation, including when sending notifications.

Reader-writer locks

Boost's reader-writer locks (boost::shared_mutex) store 8-byte state protected by a mutex, together with three condition variables. (So, using glibc, these are absolutely huge: 312 bytes, spanning five cache lines.) (Implementation here.)

The state contains a 32-bit reader lock counter, and three boolean flags: writer locked, writer waiting, and upgrade. (The last one is only used for upgrading locks (read lock → writer lock).)

One condition variable is used for notifying writers, one for notifying readers, and one is used for upgrading locks.

Every operation locks the internal mutex before it inspects or changes the state. If the state doesn't allow for the thread to continue yet, it waits for it to change using the right condition variable.

Unlocking a writer lock or unlocking a the last reader lock results in notifying both the reader and writer condition variable. One writer and all readers are woken up.

This implementation isn't terribly efficient, but it's easy to prove correct as it's all protected by a single mutex.

These locks prefer writers. A waiting writer on a read-locked lock prevent new readers from acquiring the lock, to avoid writer starvation.

(There is also a 'v2' implementation which is similar but slightly smaller. (224 bytes on glibc.) It does not support upgrading and only uses two condition variables, and it only notifies one condition variable (either the readers or the writers, not both) on unlock.)


tl;dr: Boost on Posix platforms uses only standard pthread locks, and does not directly use any platform-specific things like futexes. The mutex is just whatever the underlying libc provides (see glibc and musl above). The condition variable is mostly a wrapper around libc's version, except for an additional mutex for a trick to provide thread cancellation. But its read-writer locks are not just pthread_rwlock_t, and instead implemented on top of a mutex and three condition variables, making them very big. It prefers writers to avoid writer starvation.

pcwalton commented 2 years ago

Futexes on Linux seems like a good plan. Remember to read "Futexes Are Tricky" if we're going down that road: https://dept-info.labri.fr/~denis/Enseignement/2008-IR/Articles/01-futex.pdf This is basically an explanation of glibc's mutex implementation, as I understand it.

I would highly recommend just copying an existing mature C/C++ implementation of concurrency-primitives-built-on-futexes instead of trying to invent something new.

Do we need a 64-bit counter? A futex (on Linux and BSD, today) is 32-bit, so having 64-bit counters makes things more complicated. A 32-bit counter could potentially overflow all the way until it reaches the same value while unlocking the mutex in condvar.wait(), although that seems extremely unlikely.

It's not so much accidental overflow that is the concern here, but rather attacker-controlled overflow causing memory unsafety. This stuff can be surprisingly attacker-controllable if exploited in the right way.

m-ou-se commented 2 years ago

Futexes on Linux seems like a good plan. Remember to read "Futexes Are Tricky" if we're going down that road: https://dept-info.labri.fr/~denis/Enseignement/2008-IR/Articles/01-futex.pdf This is basically an explanation of glibc's mutex implementation, as I understand it.

That's an old version of that paper. A newer version is available here.

It explains the design of basically the most simple futex-based mutex by taking a bit of a detour through an obviously incorrect mutex first. (And a second implementation based on the assumption that Atomic*::swap does not exist.) The final simple mutex implementation it describes is indeed what glibc's mutex comes down to in practice. As mentioned above, it is identical to the mutex I implemented in my experiment before I started investigating the various existing implementations: https://github.com/m-ou-se/futex-lock-experiment/blob/ab0e9a135aff5f3b2e01218a3281cbc552d76653/src/lib.rs#L30-L109

The paper does not describe condition variables or read writer locks, which are far more interesting.

I would highly recommend just copying an existing mature C/C++ implementation of concurrency-primitives-built-on-futexes instead of trying to invent something new.

We haven't invented anything new here. My 'newly invented' experimental mutex implementation turns out to be identical to the one in glibc and the one in the paper you mentioned. My 'newly invented' experimental condition variable turns out to be identical to the one in GLib and Wine. (Not that these will necessarily be the ones we continue with.)

Of course we should be and will be very careful with going with anything that differs from what we already see in other implementations. The goal is not to 'invent something new', we are exploring what existing implementations do and learning which of their quirks and complexity is and isn't relevant for us, to be able to make an informed decision on what to copy and what not to copy.

My conclusion from all the C and C++ implementations I've investigated so far, is that a lot of their complexity comes from requirements that we do not have in Rust, such as allowing dropping while still in use, detecting misuse that Rust already statically detects through the borrow checker, runtime configuration such as whether to allow re-entrance or whether to prefer readers or writers, allowing thread interruption/cancellation, etc. etc. Just copying an existing implementation could mean we end up with a very complex Condvar that will be extremely hard to review and maintain, or an absolutely huge RwLock of 312 bytes that could've been 4 bytes.

It's not so much accidental overflow that is the concern here, but rather attacker-controlled overflow causing memory unsafety.

You cannot cause memory unsafety with such an overflow. The Condvar implementation in my experiment doesn't even use any unsafe code at all. Missing exactly 4294967296 notifications while unlocking the mutex in Condvar::wait would only cause the thread to go to sleep when it should not. While this may in some situations result in denial of service, that's not really an issue in the situation where an attacker has already gained enough control to completely block the thread in the exact right moment for long enough to make it miss over four billion notifications.

sdroege commented 2 years ago

GLib has a quite simple futex mutex/cond implementation without any bells and whistles, if you want to take a look at another one. It's probably not very different from the one you have though, there's only so many things you can possibly do differently.

m-ou-se commented 2 years ago

GLib has a quite simple futex mutex/cond implementation without any bells and whistles, if you want to take a look at another one. It's probably not very different from the one you have though, there's only so many things you can possibly do differently.

Thanks. It's indeed basically identical, except it uses an atomic add rather than a compare exchange when locking the mutex, although it does overwrite it with a 2 right afterwards if it is incremented higher than 1. This means that the state of the mutex can temporarily get higher than 2 and has a slightly higher chance of unnecessary EAGAIN returning futex wait calls. But (at least on x86) results in slightly simpler machine code. It also means it's theoretically possible to overflow the mutex causing memory unsafety.

kprotty commented 2 years ago

Would also like to recommend semaphore based synchronization primitives if it hasn't been considered already, and if simplicity (but not necessarily throughput) is a top priority.

Related, but for a performant, futex-based, RwLock there's also this algorithm.

m-ou-se commented 2 years ago

Here is a quick analysis of Apple Darwin's pthread implementation:

I'm not going into as much detail as the other implementations. MacOS has native (undocumented) syscalls for things like mutexes and rwlocks (psynch), but also supports a not-fully-documented futex-like syscall (ulock_wait). By default, the pthread locking types use their psynch based implementation, but there is also an alternative ulock 'futex based' implementation.

All of pthread_mutex_t, pthread_cond_t and pthread_rwlock_t in this implementation contain fields which need a higher alignment than the alignment of the struct itself. This issue is worked around by adding padding after the fields, and dynamically calculating the offset of the field. This means that this now seems to be the first implementation I've seen where the types are truly not movable. Moving one of these to a differently aligned address would shift around those 'overaligned' fields.

pthread_mutex_t is 64 bytes, pthread_cond_t is 48 bytes, and pthread_rwlock_t is 20 bytes.

Mutex

The mutex locking functions do not seem to spin in user space before calling into the underlying layer to wait for the mutex. They do atomic compare-exchange operations to avoid these calls on an uncontended lock, but do not spin for some time to wait for the lock to be released.

Similar to all other implementations, it also avoids the wake/unlock syscall on unlock when there is no contention.

Condition variables

Neither of the implementations of condition variables seem to requeue waiters onto the queue of the mutex, but instead wake up all threads at once. I can't see what __psynch_cvbroad ends up doing in the kernel, but it does not get a pointer to the mutex, so requeueing seems out of the question.

The struct does store a pointer to the mutex used by the wait operation, but it is only used for checking correct usage, and cleaning up when a thread gets cancelled.

Futex-based Condition variable

The most important part of the futex-based condition variable is the 64-bit state, containing a 32-bit sequence/notification counter, a 16-bit waiter counter, and a 16-bit counter for unconsumed signals. From a brief inspection of the code, it seems that it suffers from the same issue as the trivial condition variable: if it misses exactly 4294967296 notifications while unlocking the mutex, it can go to sleep when it shouldn't. That's nearly impossible in practice though.

The two 16-bit counters make it possible for pthread_cond_signal and pthread_cond_broadcast to skip the wakeup syscall in case no thread is waiting, and also in case all waiting threads have already been signalled but are still waking up.

Reader-writer lock

The default reader-writer lock prefers writers in the sense that new readers on a read-locked lock are blocked if there are any writers waiting. However, those new readers do seem to get priority over any writers that come later. That is, all readers and writers seem to be part of the same queue.

m-ou-se commented 2 years ago

And here's some details of FreeBSD's pthread implementation:

All of the pthread locking types are 'boxed' on FreeBSD. They are just a single pointer, pointing to the heap-allocated implementation. (This means that today, std::sync::Mutex on FreeBSD is effectively a Box<Box<ActualMutex>>. :( )

The kernel directly supports native syscalls for mutexes, condition variables, and read writer locks. These types are not futex-based, so of less interest to us.

Locking optionally spins with a spin loop hint, and then optionally spins with thread::yield(), before calling the locking syscall. Neither is enabled for the default mutex type.

Mutexes contain a bunch of pointers to make them part of several intrusive (doubly) linked lists, which are used to keep track of all the currently locked (and condition-variable-waited-for) mutexes used by a process, which is used for handling terminated threads/processes, and to patch up ownership when forking.

Broadcast notifications on condition variables do not result in the kernel requeueing the waiters to the mutex, but simply wake up all threads.

m-ou-se commented 2 years ago

Based on everything I've seen in the other implementations, I think these are some of the relevant design questions for futex-based locks:

1. Prevent unnecsesary wake syscall when unlocking without contention?

Univerally the answer in all implementations is yes. Without this, even an uncondented lock would require syscalls.

However, there is still a trade-off to be made about how accurately this should be done. Most implementations only keep a single 'contended' bit, which means the last unlock in a series of contended unlocks cannot know it was the last one, and will do an unnecessary syscall. This is true for at least: Glibc, Glib, Musl, and Wine.

Some implementations track more information so they know exactly if there are any waiting threads, removing an additional unnecssary syscall in the contended case. How much this is worth is debatable. This is done by at least: Microsoft, and parking_lot.

(Answers: No (none) / Yes, but not fully accurate (most) / Yes, fully accurate (some))

2. Do we keep the wait queue(s) in user-space?

Microsoft's SRW locks and condition variables, musl's condition variable, and parking_lot all keep their own waiting queue in user space, giving them full control over it and making it possible to avoid wake syscalls if there are no threads waiting.

Glibc, Wine, Glib, Apple, FreeBSD and musl (except for musl's condition variables) all do not construct any queues in user space, and leave it to the kernel to handle that. (Though Apple and FreeBSD might not be very relevant when designing for Linux, as the kernel most likely behaves differently.)

(Answers: Queues in user-space (some) / Leave it to the kernel (most))

3. Spin before the syscall when locking?

Musl, Microsoft, and parking_lot spin for a while when attempting to lock an already locked mutex before resorting to a syscall.

GLibc (by default), Glib, Wine, Apple, FreeBSD (by default) do not do this.

(Answers: Yes (some) / No (most))

4. Prefer writers over readers to avoid writer starvation?

Microsoft, Wine, Boost, parking_lot, and Apple all block new readers when there's a writer waiting. Glibc (by default) and Musl do not do this.

(Answers: Yes (most) / No (some))

5. Requeue on notify-all/broadcast?

Musl and parking_lot only wake up one waiter on notify_all (aka broadcast), and requeue the other waiters to the wait queue of the mutex, only to be woken up when the first thread unlocks the mutex.

Glibc, Glib, Wine, Microsoft, Apple, and FreeBSD simply wake up all threads and let them race for the lock.

(Answers: Yes, requeue (some) / No, wake up all (most))

6. Suppress spurious wake-ups?

Glibc and parking_lot avoid spurious wake-ups when waiting on a condition variable. If the wait syscall spuriously wakes up, they will repeat the call without returning.

Glib, Musl, and Wine do not do this. (I'm not sure if and when spurious wake-ups happen here on MacOs or FreeBSD, but they do not seem to avoid them in user space.)

(Answers: Yes (some) / No (most))

7. Prevent unnecsesary wake syscall when notifying a condition variable without waiters?

Trivial implementations like Wine and Glib do not do this. notify_one and notify_all always do a syscall.

Microsoft, Musl, and parking_lot avoid it through their user-space queue. This has the downside that notify_all does one syscall per waiting thread rather than one syscall in total.

Glibc avoids it while still using only one syscall, at the cost of higher complexity. (Leading to the time traveling signal bug in the first implementation, as mentioned above.)

(Answers: Always do a syscall on notify / notify_all does one syscall per waiting thread (so, zero if there's none) / notify_all does one syscall in total if there's any waiting thread (so, zero if there's none))

joboet commented 2 years ago

I would really like to see benchmarks for most of these choices (1, 2, 3, 5, 7), as they trade off simplicity for performance, so they'll be easier to make when you know how much performance you lose.

  1. I think it is very much in the spirit of Rust's focus on correct concurrency to prevent writers from starving. We should try to avoid the other kind of UB (unexpected behaviour) wherever possible.
RalfJung commented 2 years ago

I think it is very much in the spirit of Rust's focus on correct concurrency to prevent writers from starving. We should try to avoid the other kind of UB (unexpected behaviour) wherever possible.

As mentioned before, this is not clear-cut: that approach would mean that if a thread that holds a read lock acquires another read lock (entirely harmless by Rust standards, just two shared references in the same thread), this can lead to a dead-lock if there is a writer waiting. This will only happen rarely though, making it a pain to debug.

In other words, there is unexpected behavior either way. Avoiding unexpected behavior might be possible by making it so that when a writer is waiting, no new threads will be granted read locks, but threads already holding a read lock will be granted further read locks -- but implementing that will probably cost too much performance, I would assume.

m-ou-se commented 2 years ago

I would really like to see benchmarks for most of these choices (1, 2, 3, 5, 7), as they trade off simplicity for performance, so they'll be easier to make when you know how much performance you lose.

It's very tricky to make realistic benchmarks, or even decide what 'realistic' really is. There are many different usage patterns of locks, and optimizing a lock implementation for a specific benchmark is easy but not that useful.

Many of the choices cannot be made independently from each other, and have consequences for the complexity (and size) of the lock. There is also value in keeping them simple and small, next to keeping them performant.

I think it is very much in the spirit of Rust's focus on correct concurrency to prevent writers from starving.

Yes, Pretty much every implementation does this, except the ones that strictly follow the Posix requirements. The main downside is that recursive read locking might result in a deadlock, as Ralf explained.

m-ou-se commented 2 years ago

We discussed the design of several possible implementations in the library team meeting yesterday. Here's a summary of our most important conclusions:

m-ou-se commented 2 years ago

Started the Linux implementation for Mutex and Condvar here: https://github.com/rust-lang/rust/pull/95035

This (draft) implementation does not (yet) spin before locking.

m-ou-se commented 2 years ago

Windows' SRW lock's user-space queue is very similar to parking_lot's internal WordLock, and an interesting alternative for parking_lot as it works on all possible platforms. It doesn't need a futex; it can use anything that can be used as a thread parker. The downside is that the locks become a pointer in size rather than just a byte (or two bits), but the advantage is that avoids the complexity of a global parking lot hashmap.

I should note that it is also a very reasonable option to use this kind of implementation on all platforms, including those where futexes are available.

I have a preference for using the simpler futex-based locks on platforms where that is possible, but I believe @Amanieu has a slight preference for using the SRW/WordLock-style user-space queue based locks on all platforms.

We should consider the trade offs between these before merging any changes.

m-ou-se commented 2 years ago

For MacOS, I've heard some concerns about not using the system's locks directly. The standard locks interact well with the QoS system and handle things like priority inversion. If we implement our own locks on MacOS, we'll lose that part, which might be problematic.

We might be able to use os_unfair_lock rather than pthread_mutex_t, though.

it does not have timeouts for os_unfair_lock_lock/os_unfair_lock_trylock, so it is obviously out.

We only support timeouts for waiting on condition variables, not for locking mutexes, so that could be fine.

Also it doesn't support Condvar, which is required for Rust locks.

The Condvar implementation could be independent from the Mutex implementation, such as the trivial one in my experiment. (It'll be a bit more complicated if we can't use futex wait/wake, though.)

bjorn3 commented 2 years ago

Another argument in favor of futexes is that futexes allow priority inheritance using variants of the wait and wake commands. They also allow the kernel to favor higher priority threads I think.

m-ou-se commented 2 years ago

@bjorn3 Yes. But the priority inheritance version of those operations also expect us to store the thread id in the AtomicI32, which is not something we'd need otherwise. Getting the thread ID means accessing a thread local on every lock operation (even when uncontended), which can be significant in some situations. Maybe it can be an optional feature, or a separate type.

joboet commented 2 years ago

There is a very simple, const implementation of ReentrantMutex in lock_api which just uses a regular mutex along with two extra usize for storing the thread id and a counter, so std can eliminate the dependence on the pthread locks entirely. Also, it might be nice to make that mutex public.

kprotty commented 2 years ago

@m-ou-se For the last bullet point, the global map could be an intrusive tree to avoid dynamic heap allocation. Regardless, wouldn't this approach need to be considered for at least Condvar for simulating futex on systems that dont have OS provided futexes and aren't windows (e.g. netbsd if not relying on latest version, darwin if not willing to use __ulock)?

nbdd0121 commented 2 years ago
  • works on all possible platforms. It doesn't need a futex; it can use anything that can be used as a thread parker. The downside is that the locks become a pointer in size rather than just a byte (or two bits), but the advantage is that avoids the complexity of a global parking lot hashmap.

I want to point out that byte sized mutexes are not ideal performance-wise on all platforms. For example, MIPS and RISC-V only have 32/64 bit atomics instruction and 8-bit/16-bit atomics need to be mimicked with CAS. On these platforms the "downside" is less significant, comparing a pointer with a u32.

Also given that we currently use a boxed pthread mutex, changing it to a pointer is a gain from status quo anyway.

m-ou-se commented 2 years ago

Related, but for a performant, futex-based, RwLock there's also this algorithm.

@kprotty Thanks! Will take a look at it.

Great work on https://github.com/kprotty/usync by the way. :)

scottlamb commented 2 years ago

Worth also looking at IMHO: Google's absl::Mutex type (design notes, code, guide). Some interesting things about it:

bjorn3 commented 2 years ago

Basically this library is a fancy thread pool that uses real kernel-level threads but is similar to green threads in that it uses mostly user-level scheduling (having only O(cpus) threads available to the kernel scheduler) to improve efficiency and latency.

How much of those efficiency gains are only existing in case of relatively homogeneous workloads on big servers with lots of ram and fast io vs on an average desktop system with rapidly changing workloads and relatively little ram and slower io? The kernel scheduler should be able to adapt better for the later use case due to having knowledge about the whole system, right? While google's fiber library only knows about the local (or cooperating) processes albeit with more details.

scottlamb commented 2 years ago

AFAIK, the only scenario Google's fiber schedulers have been used is where a particular process is handed some more-or-less fixed slice of a machine's CPU to use (some fractional number of cores/hyperthreads) via Linux container groups. The efficiency advantages only come up when there are lots of busy threads within that process. The goal is to allow most of the efficiency of async with an easier programming model. I'm happy to discuss more, but maybe somewhere else? I don't want to derail this issue.

m-ou-se commented 2 years ago

Worth also looking at IMHO: Google's absl::Mutex type

It's definitely an interesting mutex. I worked at Google ~8 years ago, so I'm familiar with at least some parts of its design, since absl's version is similar in usage to the mutex that was around internally back then. Tracking the wait conditions inside of it is definitely very ergonomic in many situations, but it makes it a bit of a different beast than our std::sync::Mutex. So that's something for a community crate rather than the standard library, at least for now.

scottlamb commented 2 years ago

I worked at Google ~8 years ago, so I'm familiar with at least some parts of its design, since absl's version is similar in usage to the mutex that was around internally back then.

Ahh, yeah, it's just a new name. absl::Mutex is written everywhere in the internal codebase now, and I think the open source version matches it almost exactly. (There might be a few code blocks marked Google internal that get skipped on export.)

Tracking the wait conditions inside of it is [...] something for a community crate rather than the standard library, at least for now.

Makes sense.

The wait conditions are convenient, but the biggest reason I brought up absl::Mutex is that I hope to have something like its instrumentation some day.