rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
97.78k stars 12.66k forks source link

Tracking issue for improving std::sync::{Mutex, RwLock, Condvar} #93740

Closed m-ou-se closed 1 year ago

m-ou-se commented 2 years ago

A few weeks ago, on January 12th, the @rust-lang/libs team discussed a plan to improve std::sync::Mutex, std::sync::Condvar, and std::sync::RwLock.

On most platforms, these structures are currently wrappers around their pthread equivalent, such as pthread_mutex_t. These types are not movable, however, forcing us to wrap them in a Box, resulting in an allocation and indirection for our lock types. This also gets in the way of a const constructor for these types, which makes static locks more complicated than necessary.

@Amanieu presented his library, parking_lot, which implements the 'parking lot' structure from WebKit. Rather than letting the operating system handle any waiting queues, this manages the queues in user space in a global hash table indexed by the address of the lock. This allows for flexible 1-byte mutexes with almost no restrictions.

There has been an attempt at merging parking_lot into std, as a replacement for the implementation of std's locking primitives. However, many different blockers came up and that PR did not get merged. (See also my RustConf talk for some of this history.) Many of these blockers have been resolved since.

One of the problems with replacing std's lock implementations by parking_lot is that parking_lot allocates memory for its global hash table. A Rust program can define its own custom allocator, and such a custom allocator will likely use the standard library's locks, creating a cyclic dependency problem where you can't allocate memory without locking, but you can't lock without first allocating the hash table.

A possible solution is to use a static table with a fixed number of entries, which does not scale with the number of threads. This could work well in many situations, but can become problematic in highly parallel programs and systems. Another possible solution is to skip the global allocator and directly allocate this hash table with a native system API such as mmap on Unix. (See this thread for some discussion.)

In our meeting, we discussed what's natively available on various operating systems/platforms:

After some discussion, the consensus was to providing the locks as 'thinnest possible wrapper' around the native lock APIs as long as they are still small, efficient, and const constructible. This means SRW locks on Windows, and futex-based locks on Linux, some BSDs, and Wasm. And for other platforms without suitable native APIs, a parking_lot-based implementation using one of the possible workarounds for the allocation problem.

This means that on platforms like Linux and Windows, the operating system will be responsible for managing the waiting queues of the locks, such that any kernel improvements and features like debugging facilities in this area are directly available for Rust programs.

It also means that porting the standard library to a new platform will be easier, and maintainance of the std implementation for the less common platforms will become easier. These platforms will now only have to implement a thread parker and can rely on a performant and correct implementation of the standard locks on top of that.

Update: We've decided to not directly use parking lot's one-byte (two bit) locks, but instead use the equivalent of its internal WordLock or Windows' SRW locks, which are one pointer in size and require no global state in the program. That solves the allocation problem.

Update 2: To maintain priority inheritance of the native locks (e.g. on macOS), we've kept using pthread locks on non-futex unix platforms, at least for now. Using https://github.com/rust-lang/rust/pull/97647, we've made it possible for the locks to have const fn new.


Primary goals:

Possible later goals:


To do:

m-ou-se commented 2 years ago

For *BSD:

OpenBSD has futex(), supporting FUTEX_WAKE, FUTEX_WAIT, and FUTEX_REQUEUE.

NetBSD has SYS___futex, supporting FUTEX_WAKE, FUTEX_WAKE_BITSET, FUTEX_REQUEUE, FUTEX_CMP_REQUEUE, FUTEX_WAIT_BITSET, and FUTEX_WAKE_OP.

FreeBSD supports Linux' futex() in its Linux emulation layer, but natively only provides similar functionality through _umtx_op(UMTX_OP_WAKE) and _umtx_op(UMTX_OP_WAIT_UINT). Those are almost identical, except that the wake operation does not return the amount of threads it woke up, which is something our current futex-based RwLock uses.

Similarly, DragonFlyBSD has umtx_sleep() and umtx_wakeup(), where the wake function does not return the number of awoken threads.

m-ou-se commented 2 years ago

the wake operation does not return the amount of threads it woke up, which is something our current futex-based RwLock uses.

This doesn't break the implementation, it just means that when there's both readers and writers waiting, it'll wake up both one writer and all readers when unlocking, instead of only waking up a writer. I believe Windows' SRW lock does the same.

thomcc commented 2 years ago

NetBSD has SYS___futex, supporting FUTEX_WAKE, FUTEX_WAKE_BITSET, FUTEX_REQUEUE, FUTEX_CMP_REQUEUE, FUTEX_WAIT_BITSET, and FUTEX_WAKE_OP.

My understanding is that on NetBSD (like most BSDs), the syscall interface is not considered stable, is this not the case?

pitdicker commented 2 years ago

For *BSD:

OpenBSD has futex(), supporting FUTEX_WAKE, FUTEX_WAIT, and FUTEX_REQUEUE.

NetBSD has SYS___futex, supporting FUTEX_WAKE, FUTEX_WAKE_BITSET, FUTEX_REQUEUE, FUTEX_CMP_REQUEUE, FUTEX_WAIT_BITSET, and FUTEX_WAKE_OP.

FreeBSD supports Linux' futex() in its Linux emulation layer, but natively only provides similar functionality through _umtx_op(UMTX_OP_WAKE) and _umtx_op(UMTX_OP_WAIT_UINT). Those are almost identical, except that the wake operation does not return the amount of threads it woke up, which is something our current futex-based RwLock uses.

Similarly, DragonFlyBSD has umtx_sleep() and umtx_wakeup(), where the wake function does not return the number of awoken threads.

@mouse If it helps: at some point I started to play with the futex implementations on various platforms. To be honest, I have forgotten the details. But maybe you can find something in it that is useful? https://github.com/pitdicker/valet_parking

m-ou-se commented 2 years ago

@pitdicker Thanks! That's a great overview of what's available on all the different platforms!

LifeIsStrange commented 2 years ago

@m-ou-se @Amanieu

So, has there been benchmarks over parking_lot word lock vs futexes vs SRW locks? (I have already seen the surprisingly good results microbenchmarking the 2 bit lock vs the futex lock (although I feel like more diverses patterns of usage would be great)) An interesting data point is that parking_lot non-word lock can be twice faster when tested with <= 6 threads https://github.com/Amanieu/parking_lot/issues/33 which is the case of most multithreaded apps. Another unknown would be the performance of the stdlib windows SRWlock vs this SRWlock crate https://github.com/kprotty/usync

Also, note that futex V2 is a thing since linux 5.16 https://www.phoronix.com/scan.php?page=search&q=FUTEX2 I wonder if you could leverage it where supported for improved performance.

Also, I think there is at least one optimization in parking_lot worth backporting: transactional memory https://en.m.wikipedia.org/wiki/Transactional_Synchronization_Extensions

It allow a 40% performance gain on memory access by using mutexes only as a fallback. It offer a backward compatible API such that one hasn't to write two code paths (one for modern cpus and one for legacy).

Parking_lot use TSX hardware lock elision for RwLock here: https://github.com/Amanieu/parking_lot/blob/master/src/elision.rs As you can see, the code is very concise.

bjorn3 commented 2 years ago

TSX has been plagued with issues since it's inception. There have been timing bugs with it (which can crash your system) and it has been used in side-channel attacks to eg break KASLR. Microcode updates have been provided which disable it. And newer cpu's are entirely missing hardware lock elision support I believe. Because of this I don't really think there are advantages to using it anymore. It does have the disadvantage of making it impossible to step into a lock operation with a debugger and it may break rr (not sure about the later though).

slanterns commented 2 years ago

Also, note that futex V2 is a thing since linux 5.16

Kernel 5.16 only brings futex_waitv, which might not be so useful here. Variable sized (u8 / u16 / u32 / u64) futexes (and NUMA-awareness) have not been implemented yet.

kprotty commented 2 years ago

So, has there been benchmarks over parking_lot word lock vs futexes vs SRW locks?

usync includes throughput benchmarks from parking_lot's repo with the listed locks for comparison (usync vs stdlib vs parking_lot vs SRWLOCK-or-pthread).

An interesting data point is that parking_lot non-word lock can be twice faster when tested with <= 6 threads

The PR that was linked seems to be unrelated to the claim. Was the issue number off?

Parking_lot use TSX hardware lock elision for RwLock here:

HLE in parking_lot is disabled by default and hidden under a feature flag (hardware-lock-elision). I also tried implementing it in usync at one point but it didn't seem to effect throughput in RwLock benchmarks on the x86 machines HLE was benched under so it was removed.

m-ou-se commented 2 years ago

There is a very simple, const implementation of ReentrantMutex in lock_api which just uses a regular mutex along with two extra usize for storing the thread id and a counter, so std can eliminate the dependence on the pthread locks entirely.

@joboet I forgot to respond, but this happened in https://github.com/rust-lang/rust/pull/96042

Also, it might be nice to make that mutex public.

That's certainly a possibility, but that requires a (small) RFC that includes the motivation and full API. One interesting question would be what to do with poisoning for such a ReentrantMutex: std's other locks have poisoning, but I'm not sure if any new locks should have that, although consistency is also important. (But that's not a discussion we should have here on this issue! That's for the RFC or a separate PR/issue/Zulip/internals thread.)

m-ou-se commented 2 years ago

To unblock const Mutex::new, RwLock::new and Condvar::new, https://github.com/rust-lang/rust/pull/97647 changes these functions to no longer allocate on any platform, by instead using lazy allocation on the platforms where still use pthread.

nbdd0121 commented 2 years ago

I've noted that the description has changed from "use a thread parker" to "use lazy allocation" (which is just merged). I agree that lazy allocation is the best short-term option, but it should be noted that #85434 will not be fixed with lazy allocation; to fix it we still need a thread-parker based implementation (although this is no longer a blocker for const).

joboet commented 2 years ago

I've noted that the description has changed from "use a thread parker" to "use lazy allocation" (which is just merged). I agree that lazy allocation is the best short-term option, but it should be noted that #85434 will not be fixed with lazy allocation; to fix it we still need a thread-parker based implementation (although this is no longer a blocker for const).

That would make it impossible to use priority inheritance however, as we could not reliably reset the priority of an inheriting thread as it could have been changed while the lock was held.

For macOS, all problems would be solved if we could just use ulock: it is fast, lightweight, supports priority inheritance and even fancy things like spinning while the lock owner is active on another core. Does anyone know any Apple employees to contact about stabilizing this API? As the symbols are defined from 10.12 onwards, we would only need a fallback on older macOS versions, which seems justifiable to me (as long as Rust works OK on these systems, I think we can sacrifice a bit of performance in order to boost newer versions).

If that is not possible, we can use os_unfair_lock for Mutex (benefitting from all the nice things), falling back to a lazily-allocated pthread_mutex on older versions (if the symbols are not present). We can then fit our implementations of condition variables around this. For RwLock, priority-inheritance is not relevant anyways, so the userspace implementation can be used unconditionally. os_unfair_lock is unfortunately not strictly speaking movable when unlocked, but its API implies it is (there are no destroy functions and it is initialized with a constant).

m-ou-se commented 2 years ago

I agree that lazy allocation is the best short-term option, but it should be noted that https://github.com/rust-lang/rust/issues/85434 will not be fixed with lazy allocation; to fix it we still need a thread-parker based implementation (although this is no longer a blocker for const).

We could also still do what you suggested and try_lock the lock before dropping or leaking it.

Does anyone know any Apple employees to contact about stabilizing this API?

I've tried through various routes, but it seems like we will not be able to use those APIs any time soon, unfortunately.

If that is not possible, we can use os_unfair_lock for Mutex […]

Yes, using os_unfair_lock together with a queue or thread parker based condition variable would be a good improvement. I suppose technically the os_unfair_lock documentation doesn't guarantee it's movable, but it's just a u32 with no constructor or destructor functions, so it's nearly impossible for it to not be movable when not locked. However, we can only guarantee it's not moved when not borrowed, which doesn't necessarily mean that it's not locked, as the MutexGuard might have been leaked. It might be nice to have some confirmation that it's okay for the memory of a permanently locked os_fair_lock to be re-used.

nbdd0121 commented 2 years ago

That would make it impossible to use priority inheritance however, as we could not reliably reset the priority of an inheriting thread as it could have been changed while the lock was held.

We don't have priority inheritance for our futex implementation either. The doc deliberately says that locking behaviour is unspecified so we can be flexible about implementation.

For most users, a small and lightweight mutex would outweight the lack of priority inheritance. (There's no way to set priority with just std anyway, so I'd argue that users shouldn't expect anything about priority inheritance).

m-ou-se commented 2 years ago

How important priority inheritance is, depends on the platform. Both macOS+iOS and Fuchsia folks have told me it's very important on their platforms.

kprotty commented 2 years ago

Is the main goal of priority inheritance to avoid priority inversion? If so, priority inversion feels like anti-pattern inherently due to requiring threads in different priorities contending on the same lock.

joboet commented 2 years ago

Is the main goal of priority inheritance to avoid priority inversion? If so, priority inversion feels like anti-pattern inherently due to requiring threads in different priorities contending on the same lock.

If priority inheritance is implemented, the pattern can be quite useful. For example, in an audio player, the decoding of the next song could be run in a background thread while holding a mutex. If the song needs to be played but it has not finished decoding, the high-priority player thread can lock the mutex, thereby raising the priority of the background thread until it finishes. This pattern is really hard to implement with channels or other synchronization primitives, but PI-mutexes make it easy.

Of course, these mutexes could be implemented in a crate and most users don't actually use priorities. However, as far as I understand (having not worked on any macOS specific projects), priorities are much more prevalent in idiomatic macOS programming, for example with the Dispatch framework, so respecting the platform default is definitely a good choice.

kprotty commented 2 years ago

One would argue that you shouldn't be using mutexes for such scenarios. Instead, boosting the background thread's priority directly then waiting for it to complete or having the audio thread take over the decoding work itself via work stealing. PI-mutexes in general (excluding os_unfair_lock) have too high of a throughput cost for general use so a crate, as you've mentioned, is a good alternative.

Darwin's dispatch is fairly inefficient of a platform to program with (concurrent queues are poorly scheduled, dispatch_async requires dynamic heap allocation, dispatch_io_read/write defer to poorly scalable fs thread pools, etc.). The recommended approach for dispatch usage is to avoid blocking locks (avoids situations for priority inversions) and instead have a small, fixed amount of serial queues / "actors" on necessary priorities - treating multi-priority locking as an anti-pattern here as well.

tmandry commented 2 years ago

I feel like priority inheritance is being dismissed too aggressively.

Instead, boosting the background thread's priority directly then waiting for it to complete or having the audio thread take over the decoding work itself via work stealing.

In a simple example like this, you could manage scheduling manually if the OS supports it. But can you explain why that is better than having a mutex do it for you?

Work stealing implies adopting a very different program structure to avoid a scenario that could easily be solved by your mutex implementation.

PI-mutexes in general (excluding os_unfair_lock) have too high of a throughput cost for general use

I do not know about the general landscape, but as you note, there are exceptions. Fuchsia builds priority inheritance directly into its futexes so it's not very expensive at all.

Darwin's dispatch is fairly inefficient of a platform to program with (concurrent queues are poorly scheduled, dispatch_async requires dynamic heap allocation, dispatch_io_read/write defer to poorly scalable fs thread pools, etc.).

Sure, but Dispatch doesn't exist to provide high-throughput concurrency primitives for general compute. Its goal is to enable concurrency/parallelism, without subtle bugs, in apps that prioritize user responsiveness and efficient use of system resources.

The recommended approach for dispatch usage is to avoid blocking locks (avoids situations for priority inversions) and instead have a small, fixed amount of serial queues / "actors" on necessary priorities - treating multi-priority locking as an anti-pattern here as well.

Unless I'm misunderstanding something in this thread, locks on darwin don't cause priority inversions because they inherit priority. The recommended practice of using queues over locks is generally useful because it can simplify code, reduce bugs, and reduce the need for context switching, but I strongly suspect there are many use cases where you do need a lock.

The example that @joboet gave is one. The situation is that you have set things up to make efficient use of system resources by not prioritizing urgent things, until they actually do become urgent. Then you want the system to be able to respond effectively. I don't know of a way to change the priority/QoS of a queue that affects already running tasks. That makes priority inheritance important for responsiveness.

You can say that the application developer needs to rewrite their code, but that doesn't help the user whose iPhone catches on fire every time they hit play. So the platform prioritizes fixing these situations just in time whenever possible.


I think the broader point I want to make is two-fold:

nbdd0121 commented 2 years ago

I do not know about the general landscape, but as you note, there are exceptions. Fuchsia builds priority inheritance directly into its futexes so it's not very expensive at all.

For platforms that do have PI-futexes, it might not be so expensive. But macOS doesn't provide PI-futexes.

The example that @joboet gave is one. The situation is that you have set things up to make efficient use of system resources by not prioritizing urgent things, until they actually do become urgent. Then you want the system to be able to respond effectively. I don't know of a way to change the priority/QoS of a queue that affects already running tasks. That makes priority inheritance important for responsiveness.

That's a reasonable use case for a PI mutex. But does it have to be the mutex provided by std? We currently don't have any guarantee that std mutex will do PI, and will certainly not guarantee it because not all platforms have PI. We could guarantee PI for a specific platform, but that'll make programs depending on PI less portable.

  • Removing priority inheritance on platforms that use it privileges one class of use cases over another. Microbenchmarks are great if what you care about is high-throughput compute. If you care about user responsiveness under 99th-percentile system load, they aren't very helpful. Furthermore, I suspect that priority inheritance exists on these systems for good reason.

It can also be said that requiring PI privileges one class of use cases over another.

I would also point out that we certainly don't use PTHREAD_PRIO_INHERIT currently, so it is incorrect to say that we are "removing priority inheritance" from a platform because we don't have one in the first place.

My apology if pthread mutex on macOS enables priority inheritance by default.


I don't have problem with people using PI, I just think that it shouldn't be the default, especially if doing so would disadvantage users not requiring PI. It could be a third party crate or a platform-specifc extension.

joshtriplett commented 2 years ago

I think there are two independent questions here:

tmandry commented 2 years ago

I would also point out that we certainly don't use PTHREAD_PRIO_INHERIT currently, so it is incorrect to say that we are "removing priority inheritance" from a platform because we don't have one in the first place.

My apology if pthread mutex on macOS enables priority inheritance by default.

Ah, I probably overgeneralized! pthread mutex does enable PI by default for Fuchsia, there was a PR removing it, and I projected that onto the macOS discussion in this thread. Sorry.

  • Should we enable priority inheritance by default? Only if it has zero overhead.

Generally speaking I think we should follow the default or preferred method for each platform. If using non-PI locks is considered a hazard for the platform's use case and scheduler, then we should use PI by default on that platform. If it isn't clear what the preference is, then going with the least overhead approach makes sense.

comex commented 2 years ago

My apology if pthread mutex on macOS enables priority inheritance by default.

It does.

m-ou-se commented 2 years ago

If the people working on macOS and Fuchsia say that priority inheritance in locks on their platforms is important, I see no reason to dismiss that. As of https://github.com/rust-lang/rust/pull/97647, changes to Rust's locks on those platforms are no longer blockers for const fn new, so there's no need to push for changes beyond what the maintainers of those platforms are comfortable with.

kprotty commented 2 years ago

But can you explain why that is better than having a mutex do it for you?

@tmandry A priority inheritance mutex, for platforms that don't natively support it, requires creating a custom mutex around the thread id and atomically viewing/updating its priority for waiters. This involves external mutual exclusion and degrades the throughput of lock waiting on contention. Platforms which do support PI mutexes (FreeBSD umtx_op(mutex_lock) and Linux futex(FUTEX_PI_LOCK)) excluding things like darwin os_unfair_lock and fuschia futex_owner have even worse base throughput than their non-PI, futex-based equivalents. It penalizes the common case of not needing to be PI aware.

Work stealing implies adopting a very different program structure to avoid a scenario that could easily be solved by your mutex implementation.

You could easily solve the situation by boosting all threads in the equation and using normal mutexes as well. For the case where ease of use is a priority, this may even be better. The argument for work stealing is that it's better latency-wise for a thread already running in high priority to take over the work it needs done than to boost another threads priority then wait for it to finish and wake it up. This is through the lens of scheduling efficiency, which I assume was the reason for using high priority threads in the first place.

Its goal is to enable concurrency/parallelism, without subtle bugs, in apps that prioritize user responsiveness and efficient use of system resources.

libdispatch, while I cant claim any better on "subtle bugs" (excluding its infamous thread explosion from global queues and dispatch_sync) or "user responsiveness", does not efficiently use system resources compared to other/custom scheduling policies as pointed out by needless heap allocation and poor concurrent queue scaling. I may also be using a different baseline for "efficient".

That makes priority inheritance important for responsiveness.

Priority inheritance, in the general sense, is important but it doesn't have to be achieved via the kernel. It can be achieved in userspace (e.g. work stealing, manual boosting) since the "task" is of scheduling importance not the threads themselves. Paradoxically, being offered by the OS may or may not be better than userspace solutions.

Knowing who's waiting on who can help it do its job more effectively

While making sense theoretically, it appears true in practice for some (e.g. darwin, maybe freebsd or fuschia) and less so for others (e.g. windows, linux, netbsd) which is unfortunate.

Hiding this information and attempting to make those decisions locally within a process means every programmer is implementing their own scheduler using limited information, and this can easily lead to bad outcomes.

They effectively have to for achieving high throughput or low latency. OS-wide scheduling decisions (UMS, sleep for yielding, priority based scheduling to some degree) don't scale for individual tasks. What makes programs fast or responsive are local decisions like cache access, userspace task scheduling, manual IO/compute batching, etc. as they avoid going through the OS for optimization decisions. Even more extreme methods of program optimization involve assuming that the program owns the entire machine / bypasses the OS (cache prefetching, thread pinning, io_uring SQ thread, etc.)

If you care about user responsiveness under 99th-percentile system load, they aren't very helpful

Benchmarks can record things like lock acquisition time statistics with latency percentiles to measure responsiveness. This is helpful for assessing properties like lock fairness and wait queue overhead under configurable low and high contention.


Regarding the Mutex provided by Rust stdlib, I don't mean to advocate for not using PI-aware blocking/locking when it's advised by the platform or practically free in execution cost. However I've hopefully explained why it shouldn't be a requirement as an implementation detail for Mutex on all or any specific platform. I do advocate for higher throughput solutions even in the presence of PI-aware alternatives, although that doesn't seem to be a priority here (no pun intended).

RalfJung commented 2 years ago

I have a question about the high-level plan here: if I understood correctly, we now have (or will have) a generic Mutex/RwLock implementation on top of thread parking:

It also means that porting the standard library to a new platform will be easier, and maintainance of the std implementation for the less common platforms will become easier. These platforms will now only have to implement a thread parker and can rely on a performant and correct implementation of the standard locks on top of that.

Is that still the plan? If it is, I assume the "generic" thread parker implementation will disappear, because instead we will have "generic" Mutex/RwLock implementations? (I didn't see such a work item in the amazingly detailed tracking at the top of this issue, hence the question.)

joboet commented 2 years ago

I have a question about the high-level plan here: if I understood correctly, we now have (or will have) a generic Mutex/RwLock implementation on top of thread parking:

It also means that porting the standard library to a new platform will be easier, and maintainance of the std implementation for the less common platforms will become easier. These platforms will now only have to implement a thread parker and can rely on a performant and correct implementation of the standard locks on top of that.

Is that still the plan? If it is, I assume the "generic" thread parker implementation will disappear, because instead we will have "generic" Mutex/RwLock implementations? (I didn't see such a work item in the amazingly detailed tracking at the top of this issue, hence the question.)

I've now submitted PRs replacing it on all currently supported platforms (#97140, #98391, #98534), so the generic parker can be removed, although it doesn't need to be.

RalfJung commented 2 years ago

Thanks!

So where can I find those generic parker-based Mutex/RwLock implementations? The one in sys_common seems to just wrap some platform-specific implementation.

joboet commented 2 years ago

Thanks!

So where can I find those generic parker-based Mutex/RwLock implementations? The one in sys_common seems to just wrap some platform-specific implementation.

Nowhere, because they are not implemented yet :wink:. The simplest way to do that would be to merge @kprotty's usync.

RalfJung commented 2 years ago

Ah, that explains my confusion. :) Thanks!

joboet commented 2 years ago

There is a slight issue with os_unfair_lock: if it is locked recursively, it raises SIGTRAP. This is allowed by the API and does not result in unsoundness, but it is really unidiomatic and hard to debug. While it is possible to guard against (e.g. by installing a deadlocking signal handler and resetting it if the lock was acquired), that could negate the performance gain over the current mutex.

Edit: actually, it calls __builtin_trap (like core::intrinsics::abort), which does not have any specified behaviour. However, mach_thread_self() can be used to obtain a non-null thread id, making it possible to guard against re-entrant locking with only two extra (relaxed) atomic operations and 4 extra bytes of storage. Incidentally, the same id is used in the lock itself, but Apple's documentation explicitly forbids relying on the implementation of the lock.

m-ou-se commented 2 years ago

For Darwin, it I think it's quite unlikely we could use __ulock_{wait,wake} for a futex API directly without risking... many problems. [..]

That said, there's... a pretty cursed option for a workaround on these targets. Too cursed to use? Not sure... possibly.

That is: libc++'s implementation of std::atomic<u32>::wait, notify_one, and notify_all will ultimately bottom out to calls to __ulock_wait/__ulock_wake, which... I guess they get to use because libc++ gets shipped and updated with the OS on Darwin, fair enough. Anyway, in theory we could just call that somehow. And there's a sense in which this would be pretty nice, since it's a system-provided function that implements (something similar to) the API we want... But it's certainly not ideal for a few reasons: [..]

@thomcc I implemented that approach here: https://github.com/m-ou-se/atomic-wait/blob/main/src/macos.rs

Unfortunately, it doesn't have any timeout functionality. :(

kprotty commented 2 years ago

Could implement futex for darwin using a global hash table + pthread primitives like parking_lot does?

thomcc commented 2 years ago

Unfortunately, it doesn't have any timeout functionality. :(

Ah, yeah, that's unfortunate... FWIW I did convince @complexspaces to ask Apple engineers about stabilizing __ulock* when they met with them in a Lab at the most recent WWDC. They were apparently positive on the idea, but no movement has happened since then, AFAIK.

Could implement futex for darwin using a global hash table + pthread primitives like parking_lot does?

Might be fine on Darwin given that webkit does it, but may need different tuning than the existing impls, since spinning tends to do worse on their machines.

bjorn3 commented 2 years ago

A global hash table doesn't scale well to many threads.

kprotty commented 2 years ago

A global hash table doesn't scale well to many threads.

@bjorn3 This is how futex is implemented under most OS. Scalability is addressed by having enough buckets so that collision is rare, growing the buckets for the same reason, or using trees to make operations O(log n) to threads in the process (which wont be many).

bjorn3 commented 2 years ago

Linux uses roundup_pow_of_two(256 * num_possible_cpus()) to calculate the hash table size. num_possible_cpus() is not something we can't get in userspace due to cpu hotplugging. In addition Linux has a single roundup_pow_of_two(256 * num_possible_cpus()) sized table for the entire system, while doing this in user space would duplicate such a table for each process. Say you have a 64 core system, then you would get a 2566440/1024=640kb table for every single process. (every pthread_mutex_t is 40 bytes on Linux)

growing the buckets for the same reason

That requires blocking all on all threads that currently have a mutex locked and having all contended mutexes block on the next lock until the hash table has been grown. Additionally it makes allocation less predictable and may even deadlock if the allocator needs to acquire a lock.

thomcc commented 2 years ago

FWIW, I don't think anybody is suggesting we switch to this on Linux. It would only be used as the fallback on platforms where there's no stable futex-like API. I think Darwin is the main case here, although there are probably a few tier2 platforms that would get it as well.

Linux uses roundup_pow_of_two(256 * num_possible_cpus()) to calculate the hash table size. num_possible_cpus() is not something we can't get in userspace due to cpu hotplugging

I don't think Darwin supports CPU hotplugging to increase the number of CPUs beyond what you boot with, so I believe this is sysctlbyname("hw.ncpus", ...) there at least (note that while hw.ncpus is technically deprecated, it's actually the one we want, since it's the boot-time value of hw.logicalcpu_max).

I agree about most of the other downsides -- it's not free. I do think it's worth considering though (although likely with different tuning than the Linux impl), even on Darwin, the benefits led WebKit (which is quite careful to avoid things that cause performance issues on Darwin platforms) to use it, despite the per-process overhead and invisibility to the OS scheduler.

Admittedly, they don't use it for 100% of their locks.

Additionally it makes allocation less predictable and may even deadlock if the allocator needs to acquire a lock.

This is already the case on the platforms where this would be used :(.

kprotty commented 2 years ago

num_possible_cpus() is not something we can't get in userspace due to cpu hotplugging

@bjorn3 A userspace implementation doesn't have to copy linux verbatim. The equivalent would be std::thread::available_parallelism() but linux's approach in general is still sub-par for userspace as it requires dynamic heap allocation (the allocator may require the Mutex so its a circular dependency as you note).

while doing this in user space would duplicate such a table for each process

Yes, the table doesn't have to be large and its fine to be per-process. This is what Zig, Windows, parking_lot, Go, and OpenBSD do under the hood.

Additionally it makes allocation less predictable and may even deadlock if the allocator needs to acquire a lock.

Yes. I believe this was one of the main/current blockers for getting parking_lot into the stdlib.

What I suggest as a futex implementation isn't one that dynamically heap allocates, but something more like Zig, NetBSD, or Go where there's a static, fixed-size array of buckets. Each bucket holding a self-balancing search tree of intrusive nodes containing a list of all threads waiting on the same address.

This avoids heap allocation (the nodes for each thread can live on their stack in the futex call) and handles scalability by waiting/waking taking at worst O(log n) lookup/deletion (where n = concurrent addresses being waited on that all hash collide to the same bucket) and O(1) insertion/removal. Here's an example userspace implementation.

compilerintrinsics commented 2 years ago

Is it still possible to voice out concerns or is the implementation a done deal? I would like to see RwLock's read portion be recursive so that it is symmetric to Rust's references. Otherwise, more abstracted code might have to hand-roll their own forms of check to avoid deadlock, or change their design in an awkward manner.

Amanieu commented 2 years ago

This was an intentional design decision. Recursive read locks effectively require that readers always acquire the lock without block if there is already an existing reader. The problem is that this could lead to writer starvation where readers are always overlapping and a writer never gets a chance to acquire a write lock. This is important because many RwLock workloads are read-heavy with relatively rare writes.

RalfJung commented 2 years ago

Oh, that is quite surprising. It seems pretty hard to ensure in a codebase that there is no reader already active on the current thread -- that kind of reasoning is not usually required in Rust, and is hard to do in a compositional/modular way without compiler support. (Everybody knows one needs to be aware of that when writers are involved, but since writers are usually rare, that is a lot easier to do.) It is quite common to explain RwLock as basically a thread-safe RefCell, so subtle extra panic conditions like not allowing recursive readers are footguns waiting to go off.

I worry this might lead to hard to diagnose random panics for RwLock users. This is even more the case if recursive reading only panics on some but not all platforms, or only panics when there is a writer waiting.

Isn't it possible to let in new readers from threads that already have a read lock, but block readers from other threads?

Amanieu commented 2 years ago

AFAIK it doesn't panic on recursive reads, it just deadlocks if there happens to be another writer waiting for the lock.

Isn't it possible to let in new readers from threads that already have a read lock, but block readers from other threads?

That makes the lock much more complicated since it would need to track the thread IDs for all thread that own a read lock (of which there can be arbitrarily many).

RalfJung commented 2 years ago

AFAIK it doesn't panic on recursive reads, it just deadlocks if there happens to be another writer waiting for the lock.

That seems worse than a panic.^^ And the "if there is a writer waiting" makes it near impossible to test for this scenario.

From what I could see, the documentation only mentions the possibility of panics on recursive reads, not deadlocks.

That makes the lock much more complicated since it would need to track the thread IDs for all thread that own a read lock (of which there can be arbitrarily many).

Yeah I know. But the current situation makes using the lock in large codebases much more complicated.

thomcc commented 2 years ago

I disagree that this kind of reasoning isn't usually required -- you already need to keep an eye on what locks are currently held on your thread in order to avoid deadlocks, otherwise it seems likely that you'll hit lock ordering issues.

Amanieu commented 2 years ago

In parking_lot I added a read_recursive method which explicitly allows recursive read locks, but then you lose the no-writer-starvation guarantee.

RalfJung commented 2 years ago

I disagree that this kind of reasoning isn't usually required -- you already need to keep an eye on what locks are currently held on your thread in order to avoid deadlocks, otherwise it seems likely that you'll hit lock ordering issues.

You need to look out for conflicts between writers and other lock users, sure. Same as with RefCell.

But not usually for mutual conflicts between all lock users. (For RwLocks.) This is a departure from RefCell.

kprotty commented 2 years ago

This is a departure from RefCell.

IMO, RwLock shouldn't be treated as a drop-in for RefCell. The latter only tracks state (readers/writers), while the former synchronizes state (by blocking the caller until the desired state is achievable). An atomic RefCell more closely matches the use of try_read()/try_write() methods on an RwLock instead.

thomcc commented 2 years ago

You need to look out for conflicts between writers and other lock users, sure. Same as with RefCell.

No, you need to look out for every other lock currently held, in addition to all other uses of the same lock. It's a potential deadlock if you acquire A then B in one place, and B then A somewhere else. This can be true even for read acquires. It doesn't feel that different to me, but is wholly unlike anything you must consider for RefCell.