Open carllerche opened 4 years ago
Re-introduce a static runtime using a feature flag [..] The primary danger here is silently ending up with the "split application". Since feature flags are additive, a library or component may include the rt-global feature flag and the application does not know it is accidentally using the global runtime.
So one mitigation here is to address the "silent" aspect (emphasis added).
The global runtime could announce its creation (via log, stderr, etc - details important, but for later) unless suppressed explicitly by the developer because they want the behaviour. Suppression markers could include:
Cargo.toml,
rather than a dependency library (if it's possible to tell, somehow?)#[tokio::main]
So, basically, squeal rather than panic!
Finally, general observability and debugging aids (so I can see what runtimes are instantiated, and which tasks are using them). Again this might be addressed by simply changing some log levels and documentation/guidance on things to check in logs for your debug builds and test runs.
- that the feature was defined in the top-level binary crate
Cargo.toml,
rather than a dependency library (if it's possible to tell, somehow?)
I don't think there's really any way to determine this, unfortunately.
From these options, I (as someone who has definitely hit this a few times and remembering being pretty confused) feel that the current situation is the most attractive if the error message was sufficiently good. Maybe we could even include a link to a documentation page that lays out (a) ways to fix it and (b) background why this is the case (as laid out here -- thanks for writing it up so clearly!).
I think a good error message combined with good documentation can alleviate this problem. Looking at the docs for the runtime module, I feel it's not as clear as to what is going on under the hood as reading this issue.
enter
as showed above doesn't appear on that page either.!Send
tasks and doesn't mention LocalSet
, nor how it can work together with it. Runtime
uses AssertUnwindSafe
on user's code, but doesn't mention Unwind safety in the docs.Handle
will silently drop tasks if the Runtime
is no longer running, but that isn't mentioned. The general docs for Handle
are one phrase, and don't really mention why it exists and when (not) to use it.All in all I think the OP makes it clear there is no magical solution for this. The robust type system way is to pass a Handle
around, but it's understandable that is not necessarily convenient and comes at a perf cost. Thus I think solid docs and error messages are paramount.
Alternatively a guide level documentation on tokio.rs and a link in the API docs that really walks through the reasoning behind the Runtime design might be a good solution.
Hi there, as a Tokio user I don't have a particular positive preference about which strategy to use, but I have a very strong negative preference against adding a feature flag, because of the "spooky action at a distance" disadvantage mentioned above.
Tokio's features are already very complicated and it's easy to end up with things only working accidentally (your code uses some part of Tokio that's feature-gated without setting the feature, but it works anyways because some Tokio-using library in your workspace set the feature for you). Although I am sympathetic to the concerns that led to this design, as a user I find it extremely confusing and I think that adding new features will make the problem worse.
@hdevalence
Tokio's features are already very complicated and it's easy to end up with things only working accidentally (your code uses some part of Tokio that's feature-gated without setting the feature, but it works anyways because some Tokio-using library in your workspace set the feature for you).
One option would be to enable full
in your app and just assume everything is present.
Regarding adding a feature flag, what if the feature flag didn't enable automatic starting of the static runtime. Instead, a user that wanted the static runtime would need to specifically start one (maybe w/ a runtime::Builder
option). This would reduce the chance of something happening accidentally.
Instead, a user that wanted the static runtime would need to specifically start one (maybe w/ a
runtime::Builder
option).
How is this noticeably different than #[tokio::main]
?
My understanding is that the general source of confusion we're trying to solve here is "panics occur when users don't take explicit action to create a runtime". Providing a static runtime to be started via a builder option or something could be useful, but I don't it really solves that problem?
My guess, for #[tokio::main]
is that it could be an option:
#[tokio::main(static)]
async fn main() {
}
or something...
In general, I think we can agree that feature flags should not change behavior (I am aware I failed some on this front in 0.2).
A few questions and opinions on this topic, not especially organized:
The panic message should be more detailed if it's kept in. It can happen a lot of different ways, which may make it difficult to explain well without confusing the reader with irrelevant information:
tokio
runtimeRuntime
, but trying to do some setup work outside the runtime first (e.g. as seen in seanmonstar/reqwest#778)Runtime
, but left it - I'm not sure exactly which of spawn_blocking
, block_in_place
, etc. "leave" which parts of the runtime. Clearly thread::spawn
does.What is the consequence of polling a runtime-bound resource from a different runtime after it's already been created?
Having both implicit and explicit constructors available, or multiple equivalent code paths that do the same thing, leads to confusion.
TcpListener
should either take an explicit Handle
or use the thread-local - but not have one constructor for each option.Handle
is not a method parameter, it is not clear which methods or modules require a tokio runtime and which do not unless you try one and it panics. For this reason I have a slight preference toward explicit Handle
, but I also agree it's probably too verbose for most use cases and everyone will just use Handle::current()
most of the time.tokio
should not start a background runtime automatically. This seems like too big of a footgun, such as a library or application working fine in isolation but breaking or performing badly when combined with other applications/libraries when multiple runtimes compete against each other for CPU time.
Hope it's not to off topic, but I just remembered other gotcha's I ran into:
block_on
within a task spawned on a runtime panicsIs there a way to prevent tasks or resources cross runtime boundary? E.g, resource is owned by this runtime, if you need to await it on another runtime, then an explicit bridge must be created. Or any sharing between runtime must through some runtime(or even crate) independent channel.
I will try to answer some questions that have been asked here.
What is the consequence of polling a runtime-bound resource from a different runtime after it's already been created?
The main issue is not that it is polled from a different runtime — rather the issue is that you now have two runtimes. A Tokio runtime spawns a thread for every cpu core, so if you spawn two runtimes, you have more threads than you have cpu cores, which can lead to inefficiencies.
There's also the fact that it guarantees that your IO wont be handled on the same thread as where the future is polled. It is more efficient not to cross a thread boundary. Of course, this doesn't always happen with a single runtime either unless you use the single-threaded scheduler, but it does happen for some of the tasks.
Is there a way to prevent tasks or resources cross runtime boundary?
Not really.
I'm not sure exactly which of
spawn_blocking
,block_in_place
, etc. "leave" which parts of the runtime. Clearlythread::spawn
does.
All threads managed by Tokio are inside the runtime context, so this includes spawn_blocking
and block_in_place
. Calling tokio::spawn
and friends from inside them should work.
Threads spawned by other means are not inside the context unless they explicitly enter it with enter
.
Hope it's not to off topic, but I just remembered other gotcha's I ran into:
block_on
within a task spawned on a runtime panics- runtimes shouldn't be dropped from async context.
Yeah well this is just the sort of thing you fundamentally have to avoid in async/await, because tasks must regularly yield control back to the executor to allow other tasks to run, and block_on
doesn't do that. There is block_in_place
, but it's a pretty big footgun.
I added a new option:
tokio-global-runtime
crateThe specific name of the crate would need to be massaged. The idea is to have a separate crate define the static variable. In this case, users who wish to use a statically defined runtime would depend on this crate.
The main downside as I see it to this is that it makes the global runtime less discoverable.
@tokio-rs/maintainers I think we should move forward w/ this by adding a rt-global
feature flag that is included w/ full
. Thoughts?
I support option #1 to avoid changing behavior/APIs and instead improve the error message where possible. This feels more aligned with the general behavior of the Rust ecosystem -- explicit and informative error messages -- instead of trying to pave the road in real-time for people who don't know they're going the "wrong" way.
I feel like there are a bunch of somewhat independent issues being addressed together here, which isn't really helping. For the issue of beginners (or experienced users who are just forgetful -- this has definitely happened to me), just making the error for trying to initialize resources that require a runtime more explicit or clear will get a lot of mileage.
I'm not a fan of the feature flag approach, since additive nature of feature flags mean that it is pretty easy to get a runtime going without being aware of it.
It's a little unclear to me whether having a single static runtime would still need to depend on thread locals, or whether there would perhaps be some level of performance benefit to having a static runtime. If that is the case, I think we should definitely offer that as an option. I don't find the "lack of discoverability" argument against separate crate all that convincing -- there are plenty of ways we could advertise that option in documentation.
I quite like the explicit of &Handle
arguments, though it's a little unclear to me how pervasive (and thus potentially unergonomic) these would be in practice.
I quite like the explicit of &Handle arguments, though it's a little unclear to me how pervasive (and thus potentially unergonomic) these would be in practice.
I would tend to agree with this being the most rigorous option. Maybe you only need to have the dependent types require a handle in their constructor, not in any of the other methods.
It is however a fundamental design change for tokio. A lot is build around the global executor that is available with tokio::spawn
rather than passing around an executor/runtime. Further more the Runtime
type is a catch all type, which makes it impossible to specify at the type system level that one needs a single vs multithreaded executor, a reactor or not, a timer thread or not, etc.
If you wanted this really this to be a consistent design, you'd need types like:
CurrentThreadExec
ThreadPoolExec
Reactor
Timer
with users being explicit about what they need. That makes it clear that tokio is a combination of providers (Runtime
) as well as consumers (eg. TcpStream
). Which makes one wonder why these aren't separated, and why we can't swap out providers for other implementations (eg. take impl Spawn
as param). I have gone this way with async_executors
as an external library rather than hoping to change libs like tokio and async-std to this vision. It works all right for executors, but because of the tight integration of the reactor and timers, it's rather hard to do those from the outside.
This poll seems to indicate that in theory most people are in favor of explicitly passing things around, but in practice, the two big executors are global, and the log
crate sees much more use than slog
even though slog is an excellent library... so in practice convenience seems to win over design purity I suppose.
ps: which brings us to the point that tokio used to have an Executor
trait, but it was let go.
the log crate sees much more use than slog even though slog is an excellent library...
As someone who moved from slog
to tracing
, I think this is a dubious point of reference; I log messages way more often than I spawn tasks.
I agree with https://github.com/tokio-rs/tokio/issues/2435#issuecomment-831471416 and https://github.com/tokio-rs/tokio/issues/2435#issuecomment-831345349 that keeping the current behavior but improving the error message would get us most of the way there. I see that as the option with the fewest downsides.
@najamelan but also remember the idea of something like tokio or async-std is to attempt to hide those details and just have it work. If we expose them we make the system much more complex and harder to learn. You will lose most users in the first hour than those that need the more advanced features.
@davidpdrsn you want to take on error messages? That should happen either way.
@carllerche sure thing 😊 I'll look into it.
This poll seems to indicate that in theory most people are in favor of explicitly passing things around, but in practice, the two big executors are global, and the
log
crate sees much more use thanslog
even though slog is an excellent library... so in practice convenience seems to win over design purity I suppose.
Another point regarding thread-locals vs explicitly passing handles/contexts that I'd like to briefly point out: a lot of people think that explicitly passing arguments is "faster" than using TLS, because there's significant overhead from accessing thread-local storage relative to an argument that was passed into a function. It's correct that accessing the TLS var has a performance impact, but the thing that this perspective overlooks is that in many cases, the argument has to be threaded through several layers of function calls where it's not accessed before it reaches the function where it is accessed. I've actually heard people report that they've seen noticeable performance improvements switching from slog
, where logging contexts are passed as arguments, to tracing
, where spans are stored in TLS, because their programs are spending a lot of time copying the explicit argument on the stack. I think something like this may have an even bigger performance impact for task spawning. The typical program probably logs significantly more frequently than it spawns tasks, so in the case of passing around a handle argument for spawning tasks, I'd guess that the ratio of context-propagating function calls to context access for task-spawning handles is even more in the favor of TLS than in the case of tracing
vs slog
.
@hawkw Interesting. I admit I have not looked into the perf difference at all. I rarely write code so sensitive that passing a function param becomes an issue. Just wondering though aren't the first few params passed in registers instead of being copied on the stack? Haven't looked into calling conventions for a while.
@hawkw Interesting. I admit I have not looked into the perf difference at all. I rarely write code so sensitive that passing a function param becomes an issue.
This is actually not really the kind of performance issue that shows up in extremely performance sensitive hot code. Instead, it's a very small overhead that accumulates in large programs. From the conversations I've had with people who replaced slog
with tracing
, it's less that they noticed parameter passing for logging contexts was a major source of overhead in performance-sensitive code, and more that they were just pleasantly surprised that switching logging libraries 'magically' made their application a bit faster. :)
Just wondering though aren't the first few params passed in registers instead of being copied on the stack? Haven't looked into calling conventions for a while.
Parameters are passed in registers, if they fit in a register (e.g. they are one word or less). So, integers and pointers will likely be passed in registers; structs which contain more than a word of data generally won't be (although the optimizer may try to pass them in xmm
simd registers on x86). This is why it's often good to pass large structs by reference, when possible --- the reference will fit in a register even if the struct itself is much too large to. But, passing parameters in registers introduces overhead, too, although it's less than on the stack --- it may increase the number of mov
s in the function.
Anyway, this isn't directly relevant to this conversation, I just thought it was an interesting point to bring up.
TL;DR: As the most backwards compatible, yet improved version, I think implicit global runtime + ability to create explicit scope would be a good compromise, similarly to rayon
's API.
I would probably prefer to see &Handle
in the future, though I didn't use Rust/tokio when that API was current and not sure how cumbersome API will really become. The reason is that I like how most things in Rust work if they compile and tokio
is the opposite of this. I have hit missing runtime in various threads many times in the past and while I got used to it with time and created wrappers that for example add tokio context to rayon's threads, it is still one more annoying thing to worry about all the time and is a major source of surprising behavior. I bet it is asignificant challenge for less experienced Rust engineers contributing to large multithreaded projects.
tracing
vs slog
was mentioned above and I have nearly the same experience with them: with slog
I always know the context of things, with tracing
I regularly forget to instrument futures and getting missing/unexpected spans as the result, which is an issue that fundamentally doesn't exist with slog
.
Global runtime is actually nice compromise from user experience point of view and feels similar to what rayon
has. In rayon
one can always create custom thread pools and explicitly run code under it if desired. This results in some challenges like occasional stack overflows when things are hitting various edge-cases of rayon
(calling thread_poll.install()
from thread_pool.install()
is problematic for example), but for the most part works reasonably well. With selection between global and explicit I think the behavior is the most predictable.
To further decrease surprises global runtime creation may require explicit creation, such that one wouldn't accidentally use two runtime if they didn't mean to.
I agree with comments above that behavior should not change depending on features because feature unification is also a major source of surprising behavior in Rust ecosystem resuling in things sometimes working and sometimes not (especially in monorepos).
Method variants if introduced alongside regular methods will not fix the issue for most users since Tokio ecosystem is large and we often rely on indirect dependencies utilizing tokio-dependant crates. If there are two ways to call things, it is safe to assume there will be plenty of cases where variant with explicit &Handle
will simply not be exposed by the crate and there will be no way around it other than forking if maintainer is not particularly responsive.
Context
Tokio's resource types (
TcpStream
,time::delay
, ...) require a runtime to function. Users interact with the resource types by awaiting on results. The runtime receives events from the operating system related to these resources and dispatches the events to the appropriate waiting task.Tokio does not implicitly spawn runtimes. It is the user's responsibility to ensure a runtime is running. This is usually done using
#[tokio::main]
but creating a runtime manually is also possible.Multiple runtime flavors are provided. There is a multi-threaded, work-stealing, runtime. This runtime spawns multiple threads. It is the recommended runtime for a number of cases, including network server applications. There is also a single-threaded "in-place" runtime. This runtime spawns no threads and is useful for cases like implementing a blocking interface for an async client. This is how the
reqwest
crate's blocking module works.Additionally, there are cases in which it is useful to have a single process start multiple runtimes of the same flavor. For example,
linkerd
currently uses a pair of single-threaded runtimes, one for data-plane forwarding work and the other for control-plane tasks (serving metrics, driving service discovery lookups, and so on). This helps keep the two workloads isolated so neither can starve the other, and allows the use of!Sync
futures that avoid the overhead of synchronization, without making the entire application single-threaded. The ability to run multiple runtimes in the same process is an important feature for many use-cases.The problem
When using a
TcpStream
, the resource type must reference a runtime in order to function. Given that Tokio may have any number of runtimes running in the current process, the resource type must have some strategy by which it can select the correct runtime.Currently, this is done by using a thread-local to track the current
Runtime
. In many cases, a process only includes a single runtime. A problem arises when attempting to use a resource from outside of a runtime. In this case, it is unclear which runtime the resource type should use and Tokio willpanic!
.The strategy for fixing this is to enter a runtime context using a
runtime::Handle
.This panic is the source of confusion for users who are not aware of how Tokio searches for the runtime.
History
In the very early days,
tokio-core
always required an explicitHandle
. There was no context, thread-local or global, that stored the "current reactor". This resulted inHandle
being a field on virtually every single type or an argument to every single function. This was tedious given that, in most cases, there was only ever a single tokio-core reactor in the process. In some cases, it resulted in measurable performance degradation as theHandle
field increased struct size.Because of this,
tokio-core
started providing a static runtime. Resource types would default to using this static runtime. Resource types also included method variants that took an explicit&Handle
, allowing the user to specify a custom runtime.The primary problem with a static runtime is that it cannot easily be configured. However, all users who ended up configuring their runtime were forced to use the more verbose APIs with an explicit
&Handle
argument. Additionally, some libraries did not provide methods with an explicit&Handle
argument, preventing them from being used with custom runtimes.To solve these problems, Tokio added a thread-local tracking the "current" runtime. Now, resources would first check the thread-local and if it was not set, it would use the global runtime. This introduced a new problem. Users that intended to use a custom runtime would accidentally use their resource types from outside of their custom runtime, which would start the global runtime and shift their parts application to the global runtime. The worst part of this is everything "seemed to work" but was not doing what the user intended. Half the application ran on a static runtime with default configuration and the other half on the configured runtime. Usually, nothing was noticed until poor performance was noticed in production.
The final iteration, resulting in the Tokio of today, was to remove the concept of the global runtime in favor of the thread-local context. This prevents users from accidentally being shifted to the global runtime and things appear to work, but are in a degraded state. The consequence of this change is that attempting to use Tokio's resource types from outside of a runtime results in a panic.
Options
There are a few ways forward from here. These options are not mutually exclusive. This issue is to discuss ways forward. Feel free to propose alternate strategies as well.
Change no behavior and improve the panic message
The thread-local context logic can remain unchanged. Instead, the panic message is improved to include more context about the problem and some options for fixing it.
Re-introduce a static runtime using a feature flag
In this case, a static runtime is re-introduced. However, it is guarded by a feature flag:
rt-global
.rt-global
would also be included in thefull
meta feature.When
rt-global
is enabled, Tokio resources would first check the thread-local context. If it is set, the current runtime is used. If there isn't one set, then the global runtime is used.The primary danger here is silently ending up with the "split application". Since feature flags are additive, a library or component may include the
rt-global
feature flag and the application does not know it is accidentally using the global runtime.Provide a separate
tokio-global-runtime
crateThe specific name of the crate would need to be massaged. The idea is to have a separate crate define the static variable. In this case, users who wish to use a statically defined runtime would depend on this crate.
The main downside as I see it to this is that it makes the global runtime less discoverable.
Re-introduce
&Handle
method variantsIn this case, all async methods include a variant that takes an explicit
&runtime::Handle
. Users who want to ensure a runtime exists and it is the correct runtime may opt to be explicit about the runtime used by specifying it.This doesn't really solve the problem that users are confused when calling
TcpStream::connect
panics as it requires them knowing they must call the&runtime::Handle
variant. However, improvements to the panic message would include mention of this strategy.