Closed moparisthebest closed 9 months ago
Yes, it's possible we should change the logic for when other sleeping threads are woken up. I worry whether it's possible to fix this without increasing the number of wakeups significantly.
Alternatively, it might be nice to detect these cases and report them w/ logs / metrics. Even better would be some compiler lints.
Here's my theory for the how the deadlock occurred in this example. The first thing to note is that even if multiple worker threads are idle, only of them is waiting for IO/timer events. The remaining threads are just sleeping using park
. The way that the runtime decides who gets the IO driver is by putting it behind a mutex and having each thread call try_lock
before going to sleep, sleeping using the driver if it succeeds. Threads that fail the try_lock
call just sleep using park
.
Now, given the above, here's what I think happened:
park
. The first task is still idle.As far as I can tell, there is no bug in tokio. The example code could have happened before it just was far less likely due to erroneous thread wake ups that happened and were removed in 1.16.
The issue here is fundamentally, when one accidentally blocks the runtime, it would be nice to know that something went wrong and gracefully tolerate it. Currently, Tokio does not offer a way to do this reliably, it just "happened" to do this in some cases earlier.
The only reliable way to tolerate accidental blocking that I can think of is to have a dedicated thread to monitoring the state of each worker thread. If a thread gets stuck for too long, this monitor thread could warn & move all tasks currently on the stuck thread to a new thread.
This almost feels like a limitation of the polling model. I don't see a good way to do this without some a future uring-based driver-per-worker runtime flavor.
Well, one challenge with per-worker solutions is that if that ties each resource to a specific core, then that resource will not receive wakeups if that core is blocked.
One thought: We don't necessarily need a separate thread for monitoring this. We could make sure that whenever at least one thread is idle, one of the idle threads without the IO driver has a monitor timeout on its park
call to allow it to steal the IO driver if the other thread is blocked.
Also, for anyone who wants a solution now, you can spawn your own monitor thread. See here for an example.
Even if you don't block the runtime forever, wouldn't this possibly affect latency when it happens? Any task that becomes ready after the thread that previously held the IO driver starts running a task needs to wait until it finishes so that it can poll the IO driver again, even if you have available cores that could be handling that task.
I think ideally it should always have something parked waiting on the IO driver if there's idle threads.
One thought: We don't necessarily need a separate thread for monitoring this. We could make sure that whenever at least one thread is idle, one of the idle threads without the IO driver has a monitor timeout on its
park
call to allow it to steal the IO driver if the other thread is blocked.
Along the lines of @Darksonn's suggestion here...if a worker is currently holding the IO driver and it's transitioning to start polling its own tasks, shouldn't it try to wake a parked worker to steal the IO driver? It seems like, even if none of the tasks on the worker that's holding the IO driver will block, there's still potentially an entire scheduler tick of latency until IO is polled again, and we could avoid that by eagerly giving the IO driver to a parked thread...
Hmm. Good point.
Re-using an existing thread would be the strategy, however it still needs to be an option as waking up a thread on an interval just to do a check is fairly expensive for apps that are mostly idle.
Any updates on this? I might be running into this issue currently.
There haven't been any updates.
I think it would be a good idea to try either of the solutions from @hawkw and @Darksonn and hide them behind builder flags on tokio_unstable
kinda like with the LIFO slot. We could then have a way of looking at the performance impact on other types of applications without needing to first stabilize this behavior.
I run into the same issue I think.
I have a piece of code that does an async call (upload binary blob to S3). This code is tested and works correctly in a simple test-program. However, when using this crate (this piece of code) in the context with some dependencies (also using async code) it halts at the moment the future that uploads the binary blob is being awaited. This upload-code is completely independent of the other newly added dependencies, so the code can not be affected by the dependencies.
The explanation of @Darksonn on June 2th (no thread is watching the IO/timer events) could also explain the halted state of my program. I would welcome an option/feature to detect/analyse these kind of mysterious halting issues in the Tokio runtime (possibly as a feature that can be turned on/off as suggested by @Noah-Kennedy ).
Generally, the cause of the halting is a task that blocks the thread. You should be able to detect those using tokio-console. Tasks that block the thread show up as "busy duration" increasing while the number of polls remains constant.
I did all steps proposed by @Darksonn and analyzed using Tokio-console combined with gdb to take snapshots in different parts of (execution of) the program. This analysis shows:
Observations via tokio-console
On the resources-overview I could not find an IO-driver resource and thus could not locate the owner of that resource.
Only gdb reveal the http-timeout error when using breakpoints So I have a busy loop burning cycles on .await forever when running the program directly. When running in a debugger with breakpoints the program stops after some time as a http-timeout. However, without breakpoints the program in the debugger also loops forever without spotting this time-out.
Same S3-upload code works well in test-program When using tokio console to run a test that does the upload of a small file I see the same behavior of spinning up two tasks, but on the test program everything performs well and the file is uploaded (unless I introduce too much delay in the debugger, in which case I also trigger the http-timeout).
However, the test-program only has four tasks, while the program above has 12 tasks (due to Parquet-dependencies used to generate/compute the output-file). So I did not find a way to reproduce the issue in a small program. So at this moment I can only think of two causes:
I did add some changes to Stack and Heap by adding some variables, which adds some kind of memory-layout randomization, but this had no effect, which seems makes option 2 unlikely. So that leaves me with option 1 that I do not know how to analyze, or an option 3 that I still do not know.
Further analysis revealed my issue is option 3. I am using a mix of Sync and Async code. The three patterns described in https://tokio.rs/tokio/topics/bridging where difficult in my context, so I tried to do it with block_on. However, tokio::runtime::Runtime::block_on is not allowed, so I assumes that futures::executors::block_on did call the installed async runtime (Tokio). However, it seems this calls its own executor and bypasses Tokio, which probably causes the issues.
My alternative is using the tokio::runtime::Handle::spawn to spawn a task and use a oneshot-channel to wait for the result. However, I still have some work to do to please/fight the Rust borrow-checker on this solution. This feels like a complex solution to a simple problem, but I currently see no other options.
My issues as mentioned in Januari/februari in this thread (related to mixing of Synchronous and Asynchronous code were resolved by https://github.com/cvkem/async_bridge (generic solution I developed to make mixing of code easier. This helped me to resolve my issues back than and served me well during further development. I added documentation and learnings and I hope this also helps others forward.
if a worker is currently holding the IO driver and it's transitioning to start polling its own tasks, shouldn't it try to wake a parked worker to steal the IO driver?
I wonder why we don't create a dedicated thread to run the IO driver?
if a worker is currently holding the IO driver and it's transitioning to start polling its own tasks, shouldn't it try to wake a parked worker to steal the IO driver?
I wonder why we don't create a dedicated thread to run the IO driver?
Our current approach results in very little synchronization overhead and good tail latency under load. The background thread approach is much worse on both metrics, and is ultimately far slower and less effective for production workloads.
Spent some time looking into this; sharing some findings/thoughts just in case.
Created an minimal reproducible example for the "benign case"; i.e. where we see an impact on latency without a deadlock.
use std::time::Duration;
use std::time::Instant;
async fn handle_request(start_time: Instant) {
tokio::time::sleep(Duration::from_millis(500)).await;
println!("request took {}ms", start_time.elapsed().as_millis());
}
async fn background_job() {
loop {
tokio::time::sleep(Duration::from_secs(3)).await;
println!("background job start");
// Adjust as needed
for _ in 0..1_000 { vec![1; 1_000_000].sort() }
println!("background job finished");
}
}
#[tokio::main]
async fn main() {
tokio::spawn(async { background_job().await });
loop {
let start = Instant::now();
tokio::spawn(async move {
handle_request(start).await
}).await.unwrap();
}
}
Output:
request took 501ms
request took 501ms
request took 501ms
request took 501ms
request took 500ms
background job start
background job finished
request took 1620ms <==== high-latency request
request took 501ms
Reproduced the issue on a arm server and the rust-playground.
I do not know what rust-playground uses, but the server has 32 cores.
Handing over the the driver before running the task seems work. Not particularly familiar with the codebase, but after reading the comments above a couple time and going through the code I noticed that notify_parked_remote()
seems to do what we need; i.e. it wakes-up a worker, handing them the driver in passing. So tried adding a call to notify_parked_remote()
before the task runs and it does work in the sense that "cargo test" is happy and I'm no longer able to reproduce the issue.
i.e. made the change below to worker.rs:573.
// Make the core available to the runtime context
*self.core.borrow_mut() = Some(core);
+ // Probably the wrong way to do this
+ self.worker.handle.notify_parked_remote();
// Run the task
...
As mentioned in other comments, creating a background thread that periodically launches threads does mitigate the issue (a workaround).
(not particularly well informed)
From the comments above think everyone agrees that "this is not a good thing"; perhaps a sacrifice that is made to get some other (good) properties in return. Without knowing how much this affects real-world workloads it's hard to say how big of an issue this is. Overall think this behavior is undesirable and counter-intuitive for a work-stealing scheduler. Guess we can have a separate issue called "one benign task can halt all executor progress for a while".
Wonder if the tracing feature or tokio-metrics
provides some visibility into whether this issue is affecting a given workload or not. It's very possible that I missed such a metric/span/event but this might be a blindspot. Idk in the past I've seen some high latency requests that I could not explain, it would be useful to have a metric we could use to assess the concrete impact of this issue in the real world. A proxy for how much time the io-driver is both 1) ready and 2) held by a worker that is running a task.
I tried the multi-threaded-alt scheduler and the issue is also present there; but wonder if the alt-scheduler would be the right place to ship a fix for this issue. The alt scheduler is only available on unstable; which means it's already behind an option. (just an idea; do not have much context here).
I'll +1 this.
This is a pretty ugly performance wart. Our team has independently hit this issue twice, where some singular task takes longer than the developer expects and this results in 100ms+ of latency for API requests that we otherwise expect to complete very quickly.
The fact that one pathological task is able to harm a bunch of otherwise good API requests is alarming behavior.
Test code pasted by @santossumii using async-std
runtime.
fn main() {
async_std::task::block_on(async {
async_std::task::spawn(async { background_job().await });
loop {
let start = Instant::now();
async_std::task::spawn(async move { handle_request(start).await }).await;
}
});
}
async fn handle_request(start_time: Instant) {
async_std::task::sleep(Duration::from_millis(500)).await;
println!("request took {}ms", start_time.elapsed().as_millis());
}
async fn background_job() {
loop {
async_std::task::sleep(Duration::from_secs(3)).await;
println!("background job start");
// Adjust as needed
for _ in 0..1_000 {
vec![1; 1_000_000].sort()
}
println!("background job finished");
}
}
output
request took 500ms
request took 501ms
request took 501ms
background job start
request took 501ms
request took 501ms
request took 501ms
request took 501ms
request took 501ms
request took 501ms
request took 501ms
request took 501ms
request took 501ms
request took 501ms
request took 501ms
request took 501ms
request took 501ms
request took 500ms
background job finished
request took 501ms
request took 500ms
The heavy compute task won't halt other tasks. I don't know the implementation of tokio, but seems this is a big problem as users are very likely to hit this issue.
Except for using spawn_blocking
to spawn it to another thread, using block_in_place
also works:
use std::time::Duration;
use std::time::Instant;
async fn handle_request(start_time: Instant) {
tokio::time::sleep(Duration::from_millis(500)).await;
println!("request took {}ms", start_time.elapsed().as_millis());
}
async fn background_job() {
loop {
tokio::time::sleep(Duration::from_secs(3)).await;
println!("background job start");
// Adjust as needed
tokio::task::block_in_place(|| {
for _ in 0..1_000 {
vec![1; 1_000_000].sort()
}
});
println!("background job finished");
}
}
#[tokio::main]
async fn main() {
tokio::spawn(async { background_job().await });
loop {
let start = Instant::now();
tokio::spawn(async move { handle_request(start).await })
.await
.unwrap();
}
}
Output:
request took 502ms
request took 502ms
request took 502ms
background job start
request took 500ms
background job finished
request took 502ms
request took 502ms
request took 501ms
request took 502ms
request took 502ms
request took 501ms
background job start
request took 502ms
background job finished
request took 502ms
request took 502ms
The documentation of block_in_place
says:
/// Calling this function informs the
/// executor that the currently executing task is about to block the thread,
/// so the executor is able to hand off any other tasks it has to a new
/// worker thread before that happens.
If there are more running block_in_place
than the cpu cores, there would still be large latencies, which however is impossible to avoid as it is fundamentally caused by limited cpu resources.
Don't know how async-std is dealing with blocking tasks. For Tokio, the responsibility of dealing with blocking tasks is on the programmer.
Good workaround. But we can not control our third-party libs if they have bad tasks.
is tokio
going to fix this or what?
Look, it isn't so simple. All potential solutions have disadvantages to them. They hurt the performance of well-behaved programs, and also raise questions about forwards-compatibility.
If you want to help, you could investigate how other runtimes handle this problem. For example, what does (did?) Go do if the thread with the epoll instance gets blocked by a tight loop without yield points, and all other threads are asleep? (Runtimes that poll IO from a dedicated thread are not relevant. We are not going to do that.)
This feels somewhat similar to Spectre and Meltdown where Intel was "caught speeding" -- their optimizations were too aggressive for the happy-path (i.e. CPU only loaded with trusted, well-behaved programs) and that opened the door to other issues. Similarly, everything is great with tokio, until it's not.
Common in software engineering, it's not so much the performance or operation of a well-behaved program that matters, but dealing with the exceptional-cases or nefarious actors that requires so much thought. This might be where we're at here.
It would be interesting to know how much of a performance hit the potential solutions have and if it would be possible to feature-flag it. We could benchmark our own app, though it's probably not representative of other workloads, and we wouldn't be able to take this work item until the new year.
This problem is something you get "by default" if you do it the obvious way. It isn't because we have an extra optimization that is invalid.
and we wouldn't be able to take this work item until the new year.
You should not expect a fix for this over the holidays anyway. It's nowhere near that urgent. And there is even the workaround of spawning your own monitoring thread, as I mentioned in a previous comment.
As @darkson mentioned, there is no obvious solution to this that comes without additional overhead. Fundamentally, Tokio is a cooperative multi-tasking system, which puts some responsibility on the user. As such, I will close this issue as there is no "bug" in Tokio.
That said, there would be value in detecting and possibly mitigating poorly behaved tasks on an opt-in basis. I opened #6315 as a tracking issue and referenced this issue. This will help reframe the conversation.
Version Tested 1.16, 1.17, 1.18.2
Platform Confirmed on Linux amd64 and M1 Mac
Description
If one task gets stuck in a busy loop, or doing blocking IO, it can prevent all other tasks from being polled forever, even with the
multi_thread
executor. All that seems to be required is for the bad task to jump threads once, which seems to happen fairly randomly.This is a serious issue because:
I tried this code:
I expected to see this happen:
With 32 threads available, you'd expect one to be blocked, but both tasks should print messages and proceed.
Instead, this happened:
good still alive
will never print again