rustasync / team

A point of coordination for all things Rust and Async

https://rustasync.github.io/team/

MIT License

224 stars 29 forks source link

Enabling shared-nothing executors #13

Closed rozaliev closed 4 years ago

rozaliev commented 6 years ago

Overview

Today's Futures design seems to be focused more on M:N scheduling. It's indeed the most popular and generic use case. But I think we should also discuss future proofing Futures for shared-nothing and no_std usecases.

Some examples:

OS kernels
embedded systems
databases with shared nothing design (e.g ScyllaDB)
NUMA targeting systems

To summarize: there's usually either only one thread or many, but in latter case every one is pinned to the CPU's core. No synchronization is allowed, including atomics. For no_std task wakening is usually custom built using platform specific tools. And std systems wake tasks only in current thread, cross thread wakening is very specific to the higher level system design. Do concurrency, leave parallelism to the user.

For implementation it means that none of the standard operations like spawning task, polling or wakening should use mutexes or atomics. There's also should be no Sync and Send bounds.

Today it feels very awkward to implement something like this with futures.

Details

Wake for example should to be inside Arc and has to be Send+Sync. For implementer it means one of this options:

add synchronization everywhere
make Wake::wake noop dummy and use different waking strategy
make it really work inside only one thread and pray that it will never cross thread boundaries

Executor trait also requires every task to be Send, making every Future that wants to use a default executor incompatible with shared nothing system.

As far as can tell after a quick look this PR seems to be dealing with similar issues.

Ideas?

There might be already a way to solve all this problems that I just don't know about, in that case we should just write some docs describing how to approach this problem.

If there is no clear way, then off the top of my head, I'd say make Wake thread local, add an upgrade method that will return Option<SyncWake>. For Executor add spawn_local.

This leaves the question though, how should an intermediate Future know which API version it should use: local or sync.

What do you think?

jonhoo commented 6 years ago

There was also some discussion around similar problems on the WG-net gitter.

cramertj commented 6 years ago

What exactly is the concern with making Waker Send + Sync and storing it in an Arc? Are you concerned about performance, or does this limit the way you'd design the rest of the system in some way?

rozaliev commented 6 years ago

performance, or does this limit the way you'd design the rest of the system

Both, some kind of systems are not designed to support cross thread awakening of a task, some of them don't even have threads and some can't afford this because of latency restrictions. Having Waker Send+Sync forces them to have syncronization that wasn't there before and is actualy not required at all.

This is a nice example: http://seastar.io/shared-nothing/

Generally I think users should not pay for what they don't use.

cramertj commented 6 years ago

I'm familiar with seastar's design, and it's totally possible to write an executor like that that locks a task or set of tasks to a particular thread. However, you're right that such a system would still incur some amount of synchronization overhead from having atomic reference-counting instead of non-atomic reference counting.

When writing a Future implementation, you need to know that you can register a Waker with another object, possibly on a different thread. Many primitives for building multithreaded-compatible async futures and IO objects depend on this ability. It also makes it possible to run some synchronous work (such as file IO) on a threadpool and wake up a corresponding task on another thread.

If you need none of these things (no task or IO object will ever cross a thread boundary or interact with another task on a different thread), then I suppose we could enable this use case via runtime checks by getting rid of the Clone, Send, and Sync impls for waker and instead offering clone_local, clone_send, and as_send methods which would give you a Waker, a SyncWaker, or a &SyncWaker respectively (returning an error or panicking if the underlying implementation is not threadsafe). Similarly, the Executor trait could also offer spawn_local methods in addition to the threadsafe ones. I think this would be an ergonomic annoyance, but if it allowed significant performance improvements it might be worth it. I'm doubtful, though, that the synchronization overhead of Arc is really a significant bottleneck for multithreaded async programs.

The biggest concern I have with this approach is that all libraries would still have to choose whether or not they used the threadsafe or the non-thread-safe versions of Wakers and Executors. Choosing to use the non-thread-safe versions may allow for a small reduction in the number of uncontested atomic operations when running in single-threaded mode, but it makes it hard or impossible to integrate these libraries with multithreaded applications. I think it's better to slightly pessimize single-threaded / thread-locked applications than it is to create an ecosystem split that makes multithreaded applications impossible to build.

If you know of any benchmarks showing significant performance loss in seastar-style applications due to the overhead of atomic reference counting or synchronization in wakeup notifications, I'd love to dig in and read about why that is and what the issues are. Benchmarking I've done on my own applications hasn't shown this to be an issue. I believe @carllerche and @alexcrichton also did similar benchmarks when they decided to make tokio::Handles Sendable.

rozaliev commented 6 years ago

This is a great summary, I totaly agree on the pros/cons you described. I'd add that the thing that troubles me most about Waker is not that it's Arc, but that .wake has to be synchronized and then exectutor has to be synchronized.

I'm not aware of any benchmarks in this area, I'd love to see them too. Will google. Probably write a few myself, tho proper benchmarks are hard. If you have any ideas on the approach I'd happy to hear them. Rings of tasks comes to mind.

cramertj commented 6 years ago

@rozaliev

the thing that troubles me most about Waker is not that it's Arc, but that .wake has to be synchronized and then exectutor has to be synchronized

Yup-- I agree. It means that you need to use something like a lock-free queue of tasks, or a mutex over a slab of tasks, rather than a plain old Rc<RefCell<VecDequeue<...>>>. I think the perf difference is shadowed by other operations, but it's definitely real.

Probably write a few myself, tho proper benchmarks are hard.

That would be awesome! I'd love to see what you find out, especially if it pushes us in a different direction that allows us to get noticeable performance gains.

tmandry commented 6 years ago

I know it's not as good as a benchmark inside a real-world computation, but here are some latencies for atomics measured in this paper. These all assume the memory is in cache of one of the processors. Fetch-and-add had the same latency as compare-and-swap.

Atomic operation	Latency (3.4GHz Haswell)	Latency (2.1 GHz Bulldozer)
uncontested fetch-add	10ns	30ns
contested fetch-add (shared L2)	30ns	50ns
contested fetch-add (shared L3)	-	90 ns
contested fetch-add (different socket)	-	150 ns

One thing I love about Rust is that its fine-grained abstractions map very well to the fundamental constraints of the underlying problem space (i.e. concurrent programming on general-purpose processors). I expect most of these fundamental constraints to stay the same for a long time.

So far, I'm not convinced that an atomic operation being shadowed by other computations is fundamental to the problem space. That may be saying more about the types of problems people solve with such abstractions today as anything else.

Now to stir the pot a bit, let's take a possibly-controversial example from Seastar's homepage, processing packets at 10GB/s. I'm going to use a 10GB/s data rate after packet overhead, since I know Infiniband networks can handle this pretty easily.

Packet size	Time to process
100kb	9500ns
10kb	950ns
4096b	380ns
1024b	95ns

If we're handling 1024-byte packets at 10GBps, a single uncontested atomic operation could be 10-30% of our processing time, even in the best case. Of course, this might not be a reasonable thing to do with an async abstraction. But is there a fundamental reason why not?

The relative speeds of RAM, cache, CPU, and networks are always changing, and with them the types of problems that we solve. Plus, adding ergonomic async/await syntax to a high-performance language with zero-cost abstractions may bring out some unforeseen use cases!

It would be great if someone more knowledgeable than me could offer concrete examples of problems that might benefit from such an abstraction. I know in the world of dataflow computations, a heat diffusion simulation is a popular example. The linked code uses futures, but I don't know much about its performance characteristics other than that.

carllerche commented 6 years ago

In reply to @cramertj, I don't believe any serious benchmarks were made. Not on my part at least.

I do know that with the current implementation, the Send aspect of the reactor's Handle adds ~25% overhead to things like hyper's benchmarks. That overhead can probably reduced some, but there is no doubt that cross thread synchronization is going to incur some cost.

That said, I plan on heavily leveraging both thread locals to have fast paths for cases that result in work happening on the same thread, so in those cases, most synchronization operations can be avoided.

BurntSushi commented 6 years ago

That said, I plan on heavily leveraging both thread locals to have fast paths for cases that result in work happening on the same thread, so in those cases, most synchronization operations can be avoided.

Small plug for the thread_local crate: https://docs.rs/thread_local/0.3.5/thread_local/

The regex crate uses it precisely because it optimizes for the same thread use case.

(Apologies if my suggestion is in right field. I'm not following the broader context too closely.)

rozaliev commented 6 years ago

I did some experiments last weekend: https://github.com/rozaliev/futures-sync-bench

The benchmark includes a simple local task scheduler with swappable wake queue. 2 queues implemented, one with RefCell<VeqDeque> and the other uses crossbeam channel.

Atm there is only one test case implemented, it's a ring of tasks. We spawn a task, it spawns one more and yields, the next one does the same, until we reach Nth task. Now we have N tasks, Nth can wake N-1, and so on, until the root task is awaken, the root one wakes Nth. It's a ring. This test case is clearly synthetic and measures only a direct impact of synchronization.

This are some preliminary results I got on my macbook, so they should be taken with a grain of salt. (format: chain_$(number_oftasks)$(iterations)

test cases::ring::benches::crossbeam_unbound::chain_1000_1000 ... bench: 127,900,222 ns/iter (+/- 5,888,378)
test cases::ring::benches::crossbeam_unbound::chain_100_1     ... bench:      63,614 ns/iter (+/- 13,183)
test cases::ring::benches::crossbeam_unbound::chain_100_100   ... bench:   1,351,176 ns/iter (+/- 225,975)
test cases::ring::benches::crossbeam_unbound::chain_10_1      ... bench:       7,353 ns/iter (+/- 1,100)
test cases::ring::benches::crossbeam_unbound::chain_1_1       ... bench:       2,220 ns/iter (+/- 416)
test cases::ring::benches::crossbeam_unbound::chain_1_10      ... bench:       4,804 ns/iter (+/- 579)
test cases::ring::benches::crossbeam_unbound::chain_1_100     ... bench:      28,372 ns/iter (+/- 4,748)
test cases::ring::benches::tls_unsafe::chain_1000_1000        ... bench:  77,846,215 ns/iter (+/- 3,983,351)
test cases::ring::benches::tls_unsafe::chain_100_1            ... bench:      39,538 ns/iter (+/- 7,479)
test cases::ring::benches::tls_unsafe::chain_100_100          ... bench:     782,045 ns/iter (+/- 124,286)
test cases::ring::benches::tls_unsafe::chain_10_1             ... bench:       4,634 ns/iter (+/- 519)
test cases::ring::benches::tls_unsafe::chain_1_1              ... bench:       1,057 ns/iter (+/- 312)
test cases::ring::benches::tls_unsafe::chain_1_10             ... bencah:       2,427 ns/iter (+/- 403)
test cases::ring::benches::tls_unsafe::chain_1_100            ... bench:      17,500 ns/iter (+/- 2,688)

To make something actionable out of benchmarks:

We should agree on a set of test cases, where we know what and why we measure.
Benches should be easily configurable to run in different configurations, like thread pining for example.
Both task scheduler and tests should be checked that they are implemented in an optimal way.
A better benchmark tool should be used, I'd prefer at least to have distributions plotted.

I'm going to continue working on this as time allows. Ideas, suggestions, code reviews and PRs are very welcomed.

cramertj commented 6 years ago

I've been thinking a lot about this and talking with some other folks at Google (cc @ctiller) about how different async executors work and what requirements they have around thread-safety. While I still think that Send tasks are the correct default, and I think that libraries should default to using only Sendable futures, there are most definitely some specialized use cases (such as Andromeda and certain parts of the GRPC runtime) that benefit greatly from having one or zero atomic operations on their fast-path. They can accomplish this by pinning certain tasks to a specific thread (much like seastar).

My initial plan for how to support these use cases was to make the default Waker in task::Context be a LocalWaker (non-Send and non-Sync), but have it support a clone operation which would give you a Waker. task::Context would also offer both spawn, for Send tasks, and spawn_local for non-Send tasks. This would allow futures to be Send or not-Send on a per-task basis, allowing for zero atomic operations in the case where you're in a hot loop over exclusively non-Send tasks that have been spawned on the same thread.

Unfortunately, actually implementing spawn_local on many executors is hard, and in many cases impossible without serious performance tradeoffs. In the ideal world where a local task has been awoken and is ready to run, it's simply a matter of pulling a task out of a RefCell<VecDequeue>. However, once that queue is exhausted, the system returns to polling the cross-thread queue for IO notifications (usually an epoll-like abstraction). In order for a non-Local Waker to wake back up the Local task, there needs to be a way to tell the underlying system's epoll-like thing to wake up a specific thread, rather than just dispatching a notification to an arbitrary thread. As far as I can tell, there isn't a well-supported cross-platform way of doing this. On some versions of linux, it's possible to implement this by using a separate epoll set per thread and waiting for non-local task notifications across all of them with EPOLLEXCLUSIVE. This would allow you to wake up specific epoll sets for thread-locked wakeups as well as wake up a single arbitrary thread for the more common case of running a non-thread-bound task. I wasn't able to find a similar feature or a performant substitute on Windows, Fuchsia, or older versions of Linux, so it's not something that could be supported in a cross-platform way today.

@aturon and I discussed a solution which would allow us to future-proof task::Context for these types of optimizations without forcing executors to support spawn_local today. I propose that we add the LocalWaker and Waker split, but not the spawn_local method. Currently, this means that everyone will have to use the clone_send method rather than clone_local, and all currently-existing tasks must be Send. However, eventually executors can add support for spawn_local via a special executor stored in TLS. Tasks spawned to this executor can be non-Send and can use LocalWaker, allowing for zero atomic operations so long as the code knows that it's running on an executor that supplies these primitives.

What do y'all think? It's not the most satisfying solution in that it doesn't give us immediate access to non-Send tasks, but it does allow for someday using the futures and async/await abstractions with these types of specialized thread-bound tasks, and doesn't require any significant changes to existing executors.

tmandry commented 6 years ago

@cramertj I think that proposal makes sense. Essentially you would add a new method local_executor or similar to task::Context?

~One question I have is whether await would ever be able to work with non-default executors like this. I imagined that spawning a task manually on an executor would then cause all sub-tasks it spawned to spawn on that executor, since it would be the default in the context of that task. However, in this case, it's not possible to carry the information that a local executor is being used to await.~

~Admittedly, I haven't reviewed the async/await RFC lately so it's possible I'm confused :)~

edit: spawning is orthogonal to async/await.. must have been tired.

rozaliev commented 6 years ago

Having LocalWaker seems like a nice middleground, it opens the door to experimentations. We probably have some time for prototyping and future-proofing before async/await reaches stable.