per-thread executors - Githubissues

nikomatsakis commented 3 years ago

Brief summary

i've heard a number of people cite per-thread executors as a good model for high efficiency. There are issues like Context implementing Send+Sync. Are people doing this? Does it work? :)

Optional details

(Optional) Which character(s) would be the best fit and why?
- : the experienced "GC'd language" developer, new to Rust
- : the systems programming expert, new to Rust
- : new programmer from an unconventional background
- : the experienced Rust developer
(Optional) Which project(s) would be the best fit and why?
- List some projects here.
(Optional) What are the key points or morals to emphasize?
- Write some morals here.

robjtede commented 3 years ago

Hey, @actix lead here.

The runtime strategy used in Actix Web fits this description. From the actix-rt docs:

In most parts of the the Actix ecosystem, it has been chosen to use !Send futures. For this reason, a single-threaded runtime is appropriate since it is guaranteed that futures will not be moved between threads. This can result in small performance improvements over cases where atomics would otherwise be needed.

To achieve similar performance to multi-threaded, work-stealing runtimes, applications using actix-rt will create multiple, mostly disconnected, single-threaded runtimes. This approach has good performance characteristics for workloads where the majority of tasks have similar runtime expense.

In short, we spin up worker threads and start a single threaded Tokio runtime for each. Then spawn the server future (and therefore most user code) onto a LocalSet where there is no Send bound, unlike tokio::spawn.

This works well as our strategy (and worked especially in the early days when Tokio's multi-threaded story was not as mature as it is now).

The disadvantage is that idle threads will not steal work from very busy, stuck or otherwise backlogged threads. Tasks that are disproportionately expensive should be offloaded to the blocking task thread-pool using task::spawn_blocking.

Since work is not stolen, it is even more important to not block the thread inside request handlers than in hyper or warp. Thankfully, the solution is a blocking task threadpool which is provided by Tokio, too. It could be argued that increasing pressure to use it for expensive, blocking work actually produces more carefully considered, explicitly delimited code rather than inadvertantly using work stealing as a crutch.

Anecdotally, the benefits of supporting !Send futures goes beyond the theoretical performance increase from reducing atomic synchronisation and into developer experience, too, since no concerns about handler Send-ness arise. In general, atomics and synchronisation are opt-in and we believe it is a good approach.

A long term goal of the project may be to support a strictly single threaded mode; in which case Send futures are useless.

It should be noted that this strategy extends beyond web. The original Actix actor framework is still maintained, uses some of the same runtime primitives, and is a general solution to concurrent task problems.

nikomatsakis commented 3 years ago

Thanks!

Kobzol commented 3 years ago

So far, I have mostly been using a single single-threaded executor/runtime per core, for example in a traffic simulation that was simulating a large number of individual cars (modelled as async processes) or in a distributed task system that was handling scheduling of tasks, data movement, worker registration, etc. all on a single thread. For me, async is a way to elegantly cooperate between multiple independent "sub-processes" (let's say actors) inside a single process first, and a way to get potentially more performance second.

That being said, I feel that having multiple single-threaded executors can sometimes even improve performance, but it's not the main reason why I think that this use case should be supported on a first-class basis. If I don't need multiple-thread work-stealing (and so far I have mostly almost never needed it), having a multi-threaded runtime is just a nuisance because of the Send requirements. Sure, I could just wrap all of my data structures inside an Arc<Mutex> or something like that, but why would I do that if I know that I'm operating on a single thread? In my opinion, Send requirements should be opt-in, as that significantly improves developer ergonomics.

Thanks for the async design work btw! It's great to open up a dialogue about these things.

Matthias247 commented 3 years ago

Are people doing this? Does it work? :)

Kind of every single successful async framework since the inception of eventloops :-)

libevent, libev, libuv, boost asio, GTK, QT, seastar, nginx, javascript + node.js, dpdk, netty, grizzly, dart, etc.

Besides Rust the main frameworks which tried to do move tasks between executors are Go and C#'s Threadpool executor (although I think the ASP.NET default executor might have fixed threads).

Therefore the state of the world is actually more that Rust would need to prove that its approach of defaulting to thread-safe is a viable alternative than questioning the effectiveness of single-threaded eventloops. Their effectiveness in terms of reducing context switches and guaranteeeing good cache hit rates was more or less what triggered people to move towards the model, despite the ergonomic challenges of using callbacks.

glommer commented 3 years ago

Hi

I am the author of glommio, a thread-per-core runtime for rust that recently got some attention.

Prior to rust I was involved in the Scylla C++ database, which also uses a thread-per-core model.

Glommio is now being used at my employer to hopefully move some older systems from languages like go to rust, and it's been working quite well. My main complaint at the moment is that even some crates like hyper that technically allow you to replace your executor, will at some point assume that the executor is Send.

The thoughts behind a thread-per-core executor is that atomic operations being expensive, if you can get rid of them altogether, you win when running in large instances where contention can easily become an issue: See for instance the following 3rd-party comparison: https://github.com/fujita/greeter

pickfire commented 3 years ago

https://github.com/uazu/stakker @uazu stakker is also per-thread which is similar to actix

uazu commented 3 years ago

Yes, Stakker is a runtime per thread. I found that I had to implement my own wake-up mechanism instead of using Waker through Context to not have synchronization costs for same-thread wakeups. I can see the point of having access to a cross-thread Waker, though, so that futures that need to offload work can start threads or send inter-thread messages and then pass data back -- just that that wouldn't be the most common case. I'd expect most wakeups to be local as async/await code interacts with actor code.

I didn't consider that Context is Send+Sync. Does it really need to be? That could be an obstacle to adding things later on.

stephanemagnenat commented 3 years ago

I'm not sure whether this is the right place for this comment, but another use I see (and have) for per-thread executors is in the game context. For example, one would like to be able to define tutorials and even maybe unit behaviours (e.g. in a RTS) using async syntax but one might not want to have multi-threading there (for the sake of predictable behaviour). Also, if the game targets WASM one might not be able to use multi-threading without significant hacks (at least for now). If the game code has to run on servers (for network games), one might want to have multiple instances of async executors independently and have no global variables associated with these executors. I do not know how much the performance gain would be for such use cases if able to be written with explicitly having single-threaded constraints, for sure it would make them easier to write and possibly reduce the amount of user-written unsafe needed.

pythonesque commented 3 years ago

Removing Send+Sync would considerably improve the pollster implementation. Currently it needs to allocate an Arc and use thread-safe operations to update its readiness state (maybe it doesn't need the full Mutex + Condvar it has now, but certainly something more expensive than the Cell it should need), and if it were used improperly from other threads it would actually be an error. I believe this would be completely fixed by removing the Send+Sync requirement--no Arc, and could use Cell for its interior mutable state.

I understand that removing Send is more controversial than removing Sync; however, I'm pretty sure all the arguments for having Sync but not Send don't work currently, due to Waker implementing Clone and requiring in its contract that wakeup wake the original task (BTW, can someone explain to me why the lack of a lifetime on Waker combined with the existence of Clone don't make any lifetime-bounded Waker data invalid as well and implicitly add a 'static lifetime?). So this basically means any interior mutable state used for wakeup has to be shared when Clone is called, and if that state is !Sync then sharing it is !Send, hence the RawWaker would already be !Send in any case people care about.

Without the Clone instance, or with a looser requirement on wakeup (e.g. it might fail to wake up any task, but if it does wake up something it must be the original task), I think none of this would not be true. Does anyone actually rely on all of the "must always wake up the original task" behavior, the Clone instance, and expect to work for generic Waker, and do all of this for safety, rather than liveness? It seems pretty infeasible to me that this would be the case since what waking up a task does is mostly unspecified for arbitrary Waker.

Also, is the description of wake combined with Clone already not already strong enough that most Waker implementations don't actually satisfy it? It's certainly not true that if you clone a waker and call wake() on it, the original task's resources necessarily get dropped, but that's what seems to be implied by these two lines:

For clone:

This function will be called when wake is called on the Waker. It must wake up the task associated with this RawWaker.

For wake:

The implementation of this function must make sure to release any resources that are associated with this instance of a RawWaker and associated task.

A similar issue affects drop.

My basic thoughts here are: while changing some of this language is probably a good idea (safety requirements should be as conservative as possible until proven to actually provide optimizations) and would technically "fix" the problem in some cases, especially in combination with !Sync, it would still not really solve the spirit of the problem. Arc could become Box<Cell<_>>, but we (1) still have an extra allocation, and (2) it's there to make sure Clone stays safe and the thing is still sendable; yet this not only "solves" a case we don't want to happen (notifying from another thread for a single-threaded executor), it also makes cloning the waker not even work properly on the original thread! This hardly seems like the desired outcome to me: what we want is &Cell<Flag> or something like that, or maybe Rc<Flag> if the static lifetime is truly non-negotiable.

(As an aside, could adding a second lifetime to the Context object--the lifetime of the Waker itself--together with a lifetime to Waker, and defaulting both to 'static, help here? I realize this would be a much more significant change since it would actually break type signatures, and a backwards-compatible change would require adding a lifetime to Future since explicit implementations would need to only work for Future<'static>, but it could maybe be a target for an edition upgrade if people felt strongly about it.

(Removing Clone would be marginally less invasive in terms of types, but of course many places actually call clone directly on a Waker--presumably most/all of them are built together with an executor and know the task was spawned by the right one, but I don't know for sure).

Either way, IMO, the current implementation is just so specifically written to assume Arc that it is virtually pretty much impossible to satisfy if implemented in any other way, short of using some sort of sentinel value and doing the actual wakeup work in a thread local or static (which can be hard to do safely and efficiently, and introduces runtime panics for no real reason). I definitely understand why a Send Context can be desirable in some cases, and even more why Clone / static lifetime are, but I think in practice direct calls to poll with executors complex enough to move tasks around, clone wakers, etc., are closely tied to the particular executor, which should be able to wrap the Context in a Send newtype (and similarly could demand a Clone + 'static Waker, etc., if it came down to it...). And given that the current design is so inflexible that all the workarounds for these issues involve basically ignoring the Context object, and the fact that the Future trait is integrated into the compiler to the point that we can't just roll our own, I do think it's at least worth considering ways to solve them!

pythonesque commented 3 years ago

I'd add one more thing--I don't fully understand how the original LocalWaker proposals was expected to work, but would it have been necessary to basically implement a separate Future type for Send and !Send versions of the same async executor? That seems suboptimal...

It seems to me that the main problem with removing Send is that in essentially all cases, when you store a cloned Waker (regardless of whether you're familiar with the executor or not), you want your (necessarily hand-rolled) Future to be polymorphic in the Send-ness of the Waker you're called with. Unfortunately you have absolutely no idea about that because this information isn't carried by anything--Waker isn't polymorphic in an executor type (which is in a deep sense the root of most problems with Waker design) so there's nothing to carry that information. Moreover, there's no legal way to store something like a reference to a Waker in your type, not even with the helpful 'a lifetime the Context gives you (which could otherwise be pretty useful!), because poll is generic in that lifetime and Output is not. Therefore, if you want to use async/await syntax, your whole closure will end up !Send if the handrolled future you call ever has to store it...

I'm sure this is covered in the original RFC, but why not parameterize Futures by their Waker (and/or some phantom type) rather than do everything via virtual dispatch? Not for performance reasons (Waker does virtual dispatch itself), but for type-carrying reasons. This wouldn't hurt the composition of Future at all in practice because literally everyone who could be would be totally parametric over the Waker type... it would just allow futures that store wakers to be parametric in things like Send, Sync, Clone, and lifetime, rather than the current case where we're restricted to the union of what all executor implementations need. And it feels like it would be something that could be done in a Rust Edition by moving all uses of Future to be Future<Waker> instead (and making Waker implement Wake itself to encourage people to switch to W: Wake).

I guess the reason the above approach was not taken is due to then having to infer the Waker generic parameter for async/await, and worrying about incompatibility if there were different Waker types? I think in practice--for almost all purposes--the type for every Future would either be something parametric in some combination of Wake+Send+Sync+Clone, something like dyn Wake, or would just be Waker which implements all of those itself (this would be due to the Rust edition migration), so there would indeed be one most restrictive generic parameter we could use. I further think there's pretty good evidence (based on the insane async/await errors people get) that people are not really writing down concrete Future types very much in the first place except when implementing poll or executor stuff, but I could be very wrong on that...

In any case, I think giving an error in the situation where you use two different Futures in the same async/await block that for some reason require totally different, explicit Wake implementations, would be pretty reasonable, and also that encouraging people to just be parametric in the Waker would have essentially no downsides in most cases. But this is a very long discussion so I'd like to hear people's thoughts, as it seems pretty likely I've missed something here.

withoutboats commented 2 years ago

(NOT A CONTRIBUTION)

Contrary to my opinions in 2018, nowadays I do think it would be preferable to support single threaded wakers if possible, for users who have strong certainty that their workloads benefit from a shared-nothing thread-per-executor architecture. The fact that Context implements Send and Sync, making this impossible, is a glaring mistake that we did not think hard enough about. I am in favor of at least testing the waters on a breaking change to fix this (i.e. cratering and issuing a lint now to determine who would be broken).

The fundamental idea as I understand is that Context would get a from_local_waker constructor and a local_waker method. A Context constructed from a LocalWalker would panic when the waker method is called. Because a Waker is inherently valid as a LocalWaker, the local_waker method would always work. Fundamentally this is not that different from how the pre-1.0 API worked (both panic if you try to get a Waker from a LocalWaker), but its less confusing because users do not have to "pass through" LocalWaker to get a Waker.

I think as time goes on, it becomes more and more likely that this breakage will be too much to stomach. I think the async team should take this seriously and move quickly to assess whether a breakage is possible. I expressed this opinion privately to a few people more than a year ago. Since this has such a strong time component, I'm disappointed there hasn't been faster movement on it.

withoutboats commented 2 years ago

(NOT A CONTRIBUTION)

I'm sure this is covered in the original RFC, but why not parameterize Futures by their Waker (and/or some phantom type) rather than do everything via virtual dispatch?

This was also a fair question. To be honest, this choice was inherited from the pre-async/await version of Future. Then, I think the main benefit was that it was anticipated there would be a lot of people writing -> impl Future functions, and having to explicit parameterize those functions by W: Wake definitely seemed not worth it for what at the time was imagined as just monomorphizing the wake call. I'm not sure the distinction between Send and !Send wakers was recognized yet; when that decision was made I think tokio was also an executor-per-thread runtime and there wasn't such a thing as a Send Waker. Even getting the waker passed in as an argument was quite a fight, with some members of the community passionately in favor of using a global/thread local to store the waker.

Nowadays, -> impl Future doesn't seem as relevant. But with stream, -> impl Stream still is; even if generators are some day implemented, it seems plausible that sometimes -> impl Iterator and -> impl Stream will still make sense. Even if you don't mind making it a pain to write -> impl Stream by adding a parameter to the trait, I think adding a Wake parameter is not at all worth the churn here now that Future has been stabilized. It's possible to support !Send wakers without changing the definition of Future at the expense of a runtime check for trying to send a !Send waker.

rust-lang / wg-async

per-thread executors #87

Brief summary

Optional details