Closed rozaliev closed 4 years ago
There was also some discussion around similar problems on the WG-net
gitter.
What exactly is the concern with making Waker
Send + Sync
and storing it in an Arc
? Are you concerned about performance, or does this limit the way you'd design the rest of the system in some way?
performance, or does this limit the way you'd design the rest of the system
Both, some kind of systems are not designed to support cross thread awakening of a task, some of them don't even have threads and some can't afford this because of latency restrictions. Having Waker
Send+Sync
forces them to have syncronization that wasn't there before and is actualy not required at all.
This is a nice example: http://seastar.io/shared-nothing/
Generally I think users should not pay for what they don't use.
I'm familiar with seastar's design, and it's totally possible to write an executor like that that locks a task or set of tasks to a particular thread. However, you're right that such a system would still incur some amount of synchronization overhead from having atomic reference-counting instead of non-atomic reference counting.
When writing a Future
implementation, you need to know that you can register a Waker
with another object, possibly on a different thread. Many primitives for building multithreaded-compatible async futures and IO objects depend on this ability. It also makes it possible to run some synchronous work (such as file IO) on a threadpool and wake up a corresponding task on another thread.
If you need none of these things (no task or IO object will ever cross a thread boundary or interact with another task on a different thread), then I suppose we could enable this use case via runtime checks by getting rid of the Clone
, Send
, and Sync
impls for waker and instead offering clone_local
, clone_send
, and as_send
methods which would give you a Waker
, a SyncWaker
, or a &SyncWaker
respectively (returning an error or panicking if the underlying implementation is not threadsafe). Similarly, the Executor
trait could also offer spawn_local
methods in addition to the threadsafe ones. I think this would be an ergonomic annoyance, but if it allowed significant performance improvements it might be worth it. I'm doubtful, though, that the synchronization overhead of Arc
is really a significant bottleneck for multithreaded async programs.
The biggest concern I have with this approach is that all libraries would still have to choose whether or not they used the threadsafe or the non-thread-safe versions of Waker
s and Executor
s. Choosing to use the non-thread-safe versions may allow for a small reduction in the number of uncontested atomic operations when running in single-threaded mode, but it makes it hard or impossible to integrate these libraries with multithreaded applications. I think it's better to slightly pessimize single-threaded / thread-locked applications than it is to create an ecosystem split that makes multithreaded applications impossible to build.
If you know of any benchmarks showing significant performance loss in seastar-style applications due to the overhead of atomic reference counting or synchronization in wakeup notifications, I'd love to dig in and read about why that is and what the issues are. Benchmarking I've done on my own applications hasn't shown this to be an issue. I believe @carllerche and @alexcrichton also did similar benchmarks when they decided to make tokio::Handle
s Send
able.
This is a great summary, I totaly agree on the pros/cons you described. I'd add that the thing that troubles me most about Waker
is not that it's Arc, but that .wake
has to be synchronized and then exectutor has to be synchronized.
I'm not aware of any benchmarks in this area, I'd love to see them too. Will google. Probably write a few myself, tho proper benchmarks are hard. If you have any ideas on the approach I'd happy to hear them. Rings of tasks comes to mind.
@rozaliev
the thing that troubles me most about Waker is not that it's Arc, but that .wake has to be synchronized and then exectutor has to be synchronized
Yup-- I agree. It means that you need to use something like a lock-free queue of tasks, or a mutex over a slab of tasks, rather than a plain old Rc<RefCell<VecDequeue<...>>>
. I think the perf difference is shadowed by other operations, but it's definitely real.
Probably write a few myself, tho proper benchmarks are hard.
That would be awesome! I'd love to see what you find out, especially if it pushes us in a different direction that allows us to get noticeable performance gains.
I know it's not as good as a benchmark inside a real-world computation, but here are some latencies for atomics measured in this paper. These all assume the memory is in cache of one of the processors. Fetch-and-add had the same latency as compare-and-swap.
Atomic operation | Latency (3.4GHz Haswell) | Latency (2.1 GHz Bulldozer) |
---|---|---|
uncontested fetch-add | 10ns | 30ns |
contested fetch-add (shared L2) | 30ns | 50ns |
contested fetch-add (shared L3) | - | 90 ns |
contested fetch-add (different socket) | - | 150 ns |
One thing I love about Rust is that its fine-grained abstractions map very well to the fundamental constraints of the underlying problem space (i.e. concurrent programming on general-purpose processors). I expect most of these fundamental constraints to stay the same for a long time.
So far, I'm not convinced that an atomic operation being shadowed by other computations is fundamental to the problem space. That may be saying more about the types of problems people solve with such abstractions today as anything else.
Now to stir the pot a bit, let's take a possibly-controversial example from Seastar's homepage, processing packets at 10GB/s. I'm going to use a 10GB/s data rate after packet overhead, since I know Infiniband networks can handle this pretty easily.
Packet size | Time to process |
---|---|
100kb | 9500ns |
10kb | 950ns |
4096b | 380ns |
1024b | 95ns |
If we're handling 1024-byte packets at 10GBps, a single uncontested atomic operation could be 10-30% of our processing time, even in the best case. Of course, this might not be a reasonable thing to do with an async abstraction. But is there a fundamental reason why not?
The relative speeds of RAM, cache, CPU, and networks are always changing, and with them the types of problems that we solve. Plus, adding ergonomic async/await syntax to a high-performance language with zero-cost abstractions may bring out some unforeseen use cases!
It would be great if someone more knowledgeable than me could offer concrete examples of problems that might benefit from such an abstraction. I know in the world of dataflow computations, a heat diffusion simulation is a popular example. The linked code uses futures, but I don't know much about its performance characteristics other than that.
In reply to @cramertj, I don't believe any serious benchmarks were made. Not on my part at least.
I do know that with the current implementation, the Send
aspect of the reactor's Handle
adds ~25% overhead to things like hyper's benchmarks. That overhead can probably reduced some, but there is no doubt that cross thread synchronization is going to incur some cost.
That said, I plan on heavily leveraging both thread locals to have fast paths for cases that result in work happening on the same thread, so in those cases, most synchronization operations can be avoided.
That said, I plan on heavily leveraging both thread locals to have fast paths for cases that result in work happening on the same thread, so in those cases, most synchronization operations can be avoided.
Small plug for the thread_local crate: https://docs.rs/thread_local/0.3.5/thread_local/
The regex crate uses it precisely because it optimizes for the same thread use case.
(Apologies if my suggestion is in right field. I'm not following the broader context too closely.)
I did some experiments last weekend: https://github.com/rozaliev/futures-sync-bench
The benchmark includes a simple local task scheduler with swappable wake queue. 2 queues implemented, one with RefCell<VeqDeque>
and the other uses crossbeam channel.
Atm there is only one test case implemented, it's a ring of tasks. We spawn a task, it spawns one more and yields, the next one does the same, until we reach Nth task. Now we have N tasks, Nth can wake N-1, and so on, until the root task is awaken, the root one wakes Nth. It's a ring. This test case is clearly synthetic and measures only a direct impact of synchronization.
This are some preliminary results I got on my macbook, so they should be taken with a grain of salt. (format: chain_$(number_oftasks)$(iterations)
test cases::ring::benches::crossbeam_unbound::chain_1000_1000 ... bench: 127,900,222 ns/iter (+/- 5,888,378)
test cases::ring::benches::crossbeam_unbound::chain_100_1 ... bench: 63,614 ns/iter (+/- 13,183)
test cases::ring::benches::crossbeam_unbound::chain_100_100 ... bench: 1,351,176 ns/iter (+/- 225,975)
test cases::ring::benches::crossbeam_unbound::chain_10_1 ... bench: 7,353 ns/iter (+/- 1,100)
test cases::ring::benches::crossbeam_unbound::chain_1_1 ... bench: 2,220 ns/iter (+/- 416)
test cases::ring::benches::crossbeam_unbound::chain_1_10 ... bench: 4,804 ns/iter (+/- 579)
test cases::ring::benches::crossbeam_unbound::chain_1_100 ... bench: 28,372 ns/iter (+/- 4,748)
test cases::ring::benches::tls_unsafe::chain_1000_1000 ... bench: 77,846,215 ns/iter (+/- 3,983,351)
test cases::ring::benches::tls_unsafe::chain_100_1 ... bench: 39,538 ns/iter (+/- 7,479)
test cases::ring::benches::tls_unsafe::chain_100_100 ... bench: 782,045 ns/iter (+/- 124,286)
test cases::ring::benches::tls_unsafe::chain_10_1 ... bench: 4,634 ns/iter (+/- 519)
test cases::ring::benches::tls_unsafe::chain_1_1 ... bench: 1,057 ns/iter (+/- 312)
test cases::ring::benches::tls_unsafe::chain_1_10 ... bencah: 2,427 ns/iter (+/- 403)
test cases::ring::benches::tls_unsafe::chain_1_100 ... bench: 17,500 ns/iter (+/- 2,688)
To make something actionable out of benchmarks:
I'm going to continue working on this as time allows. Ideas, suggestions, code reviews and PRs are very welcomed.
I've been thinking a lot about this and talking with some other folks at Google (cc @ctiller) about how different async executors work and what requirements they have around thread-safety. While I still think that Send
tasks are the correct default, and I think that libraries should default to using only Send
able futures, there are most definitely some specialized use cases (such as Andromeda and certain parts of the GRPC runtime) that benefit greatly from having one or zero atomic operations on their fast-path. They can accomplish this by pinning certain tasks to a specific thread (much like seastar).
My initial plan for how to support these use cases was to make the default Waker
in task::Context
be a LocalWaker
(non-Send
and non-Sync
), but have it support a clone
operation which would give you a Waker
. task::Context
would also offer both spawn
, for Send
tasks, and spawn_local
for non-Send
tasks. This would allow futures to be Send
or not-Send
on a per-task basis, allowing for zero atomic operations in the case where you're in a hot loop over exclusively non-Send
tasks that have been spawned on the same thread.
Unfortunately, actually implementing spawn_local
on many executors is hard, and in many cases impossible without serious performance tradeoffs. In the ideal world where a local task has been awoken and is ready to run, it's simply a matter of pulling a task out of a RefCell<VecDequeue>
. However, once that queue is exhausted, the system returns to polling the cross-thread queue for IO notifications (usually an epoll-like abstraction). In order for a non-Local
Waker
to wake back up the Local
task, there needs to be a way to tell the underlying system's epoll
-like thing to wake up a specific thread, rather than just dispatching a notification to an arbitrary thread. As far as I can tell, there isn't a well-supported cross-platform way of doing this. On some versions of linux, it's possible to implement this by using a separate epoll set per thread and waiting for non-local task notifications across all of them with EPOLLEXCLUSIVE. This would allow you to wake up specific epoll sets for thread-locked wakeups as well as wake up a single arbitrary thread for the more common case of running a non-thread-bound task. I wasn't able to find a similar feature or a performant substitute on Windows, Fuchsia, or older versions of Linux, so it's not something that could be supported in a cross-platform way today.
@aturon and I discussed a solution which would allow us to future-proof task::Context
for these types of optimizations without forcing executors to support spawn_local
today. I propose that we add the LocalWaker
and Waker
split, but not the spawn_local
method. Currently, this means that everyone will have to use the clone_send
method rather than clone_local
, and all currently-existing tasks must be Send
. However, eventually executors can add support for spawn_local
via a special executor stored in TLS. Tasks spawned to this executor can be non-Send
and can use LocalWaker
, allowing for zero atomic operations so long as the code knows that it's running on an executor that supplies these primitives.
What do y'all think? It's not the most satisfying solution in that it doesn't give us immediate access to non-Send
tasks, but it does allow for someday using the futures
and async/await
abstractions with these types of specialized thread-bound tasks, and doesn't require any significant changes to existing executors.
@cramertj I think that proposal makes sense. Essentially you would add a new method local_executor
or similar to task::Context
?
~One question I have is whether await
would ever be able to work with non-default executors like this. I imagined that spawning a task manually on an executor would then cause all sub-tasks it spawned to spawn on that executor, since it would be the default in the context of that task. However, in this case, it's not possible to carry the information that a local executor is being used to await
.~
~Admittedly, I haven't reviewed the async/await RFC lately so it's possible I'm confused :)~
edit: spawning is orthogonal to async/await.. must have been tired.
Having LocalWaker
seems like a nice middleground, it opens the door to experimentations. We probably have some time for prototyping and future-proofing before async/await reaches stable.
Overview
Today's Futures design seems to be focused more on M:N scheduling. It's indeed the most popular and generic use case. But I think we should also discuss future proofing Futures for shared-nothing and
no_std
usecases.Some examples:
To summarize: there's usually either only one thread or many, but in latter case every one is pinned to the CPU's core. No synchronization is allowed, including atomics. For
no_std
task wakening is usually custom built using platform specific tools. Andstd
systems wake tasks only in current thread, cross thread wakening is very specific to the higher level system design. Do concurrency, leave parallelism to the user.For implementation it means that none of the standard operations like spawning task, polling or wakening should use mutexes or atomics. There's also should be no Sync and Send bounds.
Today it feels very awkward to implement something like this with futures.
Details
Wake
for example should to be insideArc
and has to be Send+Sync. For implementer it means one of this options:Wake::wake
noop dummy and use different waking strategyExecutor
trait also requires every task to beSend
, making every Future that wants to use a default executor incompatible with shared nothing system.As far as can tell after a quick look this PR seems to be dealing with similar issues.
Ideas?
There might be already a way to solve all this problems that I just don't know about, in that case we should just write some docs describing how to approach this problem.
If there is no clear way, then off the top of my head, I'd say make
Wake
thread local, add anupgrade
method that will returnOption<SyncWake>
. ForExecutor
addspawn_local
.This leaves the question though, how should an intermediate Future know which API version it should use: local or sync.
What do you think?