One bad task can halt all executor progress forever

moparisthebest commented 2 years ago

Version Tested 1.16, 1.17, 1.18.2

Platform Confirmed on Linux amd64 and M1 Mac

Description

If one task gets stuck in a busy loop, or doing blocking IO, it can prevent all other tasks from being polled forever, even with the multi_thread executor. All that seems to be required is for the bad task to jump threads once, which seems to happen fairly randomly.

This is a serious issue because:

it can run for months without triggering, only for your whole program to freeze without being obvious as to why
in any non-trivial program, you'll have enough tasks and dependencies that you can't reasonably guarantee none will ever block/busy loop

I tried this code:

use std::thread;
use std::time::Duration;

//#[tokio::main]
#[tokio::main(flavor = "multi_thread", worker_threads = 32)]
async fn main() {
    let mut handles = Vec::new();

    handles.push(tokio::spawn({
        async {
            loop {
                println!("{:?}: good still alive", thread::current().id());
                tokio::time::sleep(Duration::from_secs(10)).await;
            }
        }
    }));
    handles.push(tokio::spawn({
        async {
            let orig_thread_id = format!("{:?}", thread::current().id());
            loop {
                println!("{:?}: bad still alive", thread::current().id());
                thread::sleep(Duration::from_secs(10));
                loop {
                    // here we loop and sleep until we switch threads, once we do, we never call await again
                    // blocking all progress on all other tasks forever
                    let thread_id = format!("{:?}", thread::current().id());
                    if thread_id == orig_thread_id {
                        tokio::time::sleep(Duration::from_secs(1)).await;
                    } else {
                        break;
                    }
                }
            }
        }
    }));

    for handle in handles {
        handle.await.expect("handle await");
    }
}

I expected to see this happen:

With 32 threads available, you'd expect one to be blocked, but both tasks should print messages and proceed.

Instead, this happened:

ThreadId(27): good still alive
ThreadId(23): bad still alive
ThreadId(2): good still alive
ThreadId(2): bad still alive
ThreadId(2): bad still alive
ThreadId(2): bad still alive
*snip*

good still alive will never print again

Darksonn commented 2 years ago

Yes, it's possible we should change the logic for when other sleeping threads are woken up. I worry whether it's possible to fix this without increasing the number of wakeups significantly.

carllerche commented 2 years ago

Alternatively, it might be nice to detect these cases and report them w/ logs / metrics. Even better would be some compiler lints.

Darksonn commented 2 years ago

Here's my theory for the how the deadlock occurred in this example. The first thing to note is that even if multiple worker threads are idle, only of them is waiting for IO/timer events. The remaining threads are just sleeping using park. The way that the runtime decides who gets the IO driver is by putting it behind a mutex and having each thread call try_lock before going to sleep, sleeping using the driver if it succeeds. Threads that fail the try_lock call just sleep using park.

Now, given the above, here's what I think happened:

All tasks are idle. Thread 2 has the IO driver.
Thread 2 is woken by the second task's timer. The other threads continue to block on park. The first task is still idle.
Thread 2 starts polling the second task.
When the first timer expires, nobody notices because no thread is sleeping on the IO driver. No work stealing happens because the first task is idle, and will continue to be idle until the IO driver runs.

carllerche commented 2 years ago

As far as I can tell, there is no bug in tokio. The example code could have happened before it just was far less likely due to erroneous thread wake ups that happened and were removed in 1.16.

The issue here is fundamentally, when one accidentally blocks the runtime, it would be nice to know that something went wrong and gracefully tolerate it. Currently, Tokio does not offer a way to do this reliably, it just "happened" to do this in some cases earlier.

The only reliable way to tolerate accidental blocking that I can think of is to have a dedicated thread to monitoring the state of each worker thread. If a thread gets stuck for too long, this monitor thread could warn & move all tasks currently on the stuck thread to a new thread.

Noah-Kennedy commented 2 years ago

This almost feels like a limitation of the polling model. I don't see a good way to do this without some a future uring-based driver-per-worker runtime flavor.

Darksonn commented 2 years ago

Well, one challenge with per-worker solutions is that if that ties each resource to a specific core, then that resource will not receive wakeups if that core is blocked.

One thought: We don't necessarily need a separate thread for monitoring this. We could make sure that whenever at least one thread is idle, one of the idle threads without the IO driver has a monitor timeout on its park call to allow it to steal the IO driver if the other thread is blocked.

Also, for anyone who wants a solution now, you can spawn your own monitor thread. See here for an example.

gperinazzo commented 2 years ago

Even if you don't block the runtime forever, wouldn't this possibly affect latency when it happens? Any task that becomes ready after the thread that previously held the IO driver starts running a task needs to wait until it finishes so that it can poll the IO driver again, even if you have available cores that could be handling that task.

I think ideally it should always have something parked waiting on the IO driver if there's idle threads.

hawkw commented 2 years ago

One thought: We don't necessarily need a separate thread for monitoring this. We could make sure that whenever at least one thread is idle, one of the idle threads without the IO driver has a monitor timeout on its park call to allow it to steal the IO driver if the other thread is blocked.

Along the lines of @Darksonn's suggestion here...if a worker is currently holding the IO driver and it's transitioning to start polling its own tasks, shouldn't it try to wake a parked worker to steal the IO driver? It seems like, even if none of the tasks on the worker that's holding the IO driver will block, there's still potentially an entire scheduler tick of latency until IO is polled again, and we could avoid that by eagerly giving the IO driver to a parked thread...

Noah-Kennedy commented 2 years ago

Hmm. Good point.

carllerche commented 2 years ago

Re-using an existing thread would be the strategy, however it still needs to be an option as waking up a thread on an interval just to do a check is fairly expensive for apps that are mostly idle.

aegooby commented 1 year ago

Any updates on this? I might be running into this issue currently.

Darksonn commented 1 year ago

There haven't been any updates.

Noah-Kennedy commented 1 year ago

I think it would be a good idea to try either of the solutions from @hawkw and @Darksonn and hide them behind builder flags on tokio_unstable kinda like with the LIFO slot. We could then have a way of looking at the performance impact on other types of applications without needing to first stabilize this behavior.

cvkem commented 1 year ago

I run into the same issue I think.

I have a piece of code that does an async call (upload binary blob to S3). This code is tested and works correctly in a simple test-program. However, when using this crate (this piece of code) in the context with some dependencies (also using async code) it halts at the moment the future that uploads the binary blob is being awaited. This upload-code is completely independent of the other newly added dependencies, so the code can not be affected by the dependencies.

The explanation of @Darksonn on June 2th (no thread is watching the IO/timer events) could also explain the halted state of my program. I would welcome an option/feature to detect/analyse these kind of mysterious halting issues in the Tokio runtime (possibly as a feature that can be turned on/off as suggested by @Noah-Kennedy ).

Darksonn commented 1 year ago

Generally, the cause of the halting is a task that blocks the thread. You should be able to detect those using tokio-console. Tasks that block the thread show up as "busy duration" increasing while the number of polls remains constant.

cvkem commented 1 year ago

I did all steps proposed by @Darksonn and analyzed using Tokio-console combined with gdb to take snapshots in different parts of (execution of) the program. This analysis shows:

Observations via tokio-console

Task/thread that should perform the upload generates two new tasks:
1. An Idle tokio::task without Waker in Location: /hyper-0.14.23/src/common/exec.rs:49:21
2. A Stopped tokio::task::blocking without Waker in Location:
The Task/thread that spawned the other tasks is jumping around forever on an await and the block-on one step further up in the stack (yes this is a mix of Async and sync code using the channel strategy mentioned in https://tokio.rs/tokio/topics/bridging)

On the resources-overview I could not find an IO-driver resource and thus could not locate the owner of that resource.

Only gdb reveal the http-timeout error when using breakpoints So I have a busy loop burning cycles on .await forever when running the program directly. When running in a debugger with breakpoints the program stops after some time as a http-timeout. However, without breakpoints the program in the debugger also loops forever without spotting this time-out.

Same S3-upload code works well in test-program When using tokio console to run a test that does the upload of a small file I see the same behavior of spinning up two tasks, but on the test program everything performs well and the file is uploaded (unless I introduce too much delay in the debugger, in which case I also trigger the http-timeout).

However, the test-program only has four tasks, while the program above has 12 tasks (due to Parquet-dependencies used to generate/compute the output-file). So I did not find a way to reproduce the issue in a small program. So at this moment I can only think of two causes:

The IOdriver resource is held by one of the blocked Tasks (as mentioned in your June-2 comment).
The Parquet dependency used has some unsafe code, which corrupts the data of the Tokio.
....

I did add some changes to Stack and Heap by adding some variables, which adds some kind of memory-layout randomization, but this had no effect, which seems makes option 2 unlikely. So that leaves me with option 1 that I do not know how to analyze, or an option 3 that I still do not know.

cvkem commented 1 year ago

Further analysis revealed my issue is option 3. I am using a mix of Sync and Async code. The three patterns described in https://tokio.rs/tokio/topics/bridging where difficult in my context, so I tried to do it with block_on. However, tokio::runtime::Runtime::block_on is not allowed, so I assumes that futures::executors::block_on did call the installed async runtime (Tokio). However, it seems this calls its own executor and bypasses Tokio, which probably causes the issues.

My alternative is using the tokio::runtime::Handle::spawn to spawn a task and use a oneshot-channel to wait for the result. However, I still have some work to do to please/fight the Rust borrow-checker on this solution. This feels like a complex solution to a simple problem, but I currently see no other options.

cvkem commented 1 year ago

My issues as mentioned in Januari/februari in this thread (related to mixing of Synchronous and Asynchronous code were resolved by https://github.com/cvkem/async_bridge (generic solution I developed to make mixing of code easier. This helped me to resolve my issues back than and served me well during further development. I added documentation and learnings and I hope this also helps others forward.

jiacai2050 commented 1 year ago

if a worker is currently holding the IO driver and it's transitioning to start polling its own tasks, shouldn't it try to wake a parked worker to steal the IO driver?

I wonder why we don't create a dedicated thread to run the IO driver?

Noah-Kennedy commented 1 year ago

if a worker is currently holding the IO driver and it's transitioning to start polling its own tasks, shouldn't it try to wake a parked worker to steal the IO driver?

I wonder why we don't create a dedicated thread to run the IO driver?

Our current approach results in very little synchronization overhead and good tail latency under load. The background thread approach is much worse on both metrics, and is ultimately far slower and less effective for production workloads.

santossumii commented 1 year ago

Spent some time looking into this; sharing some findings/thoughts just in case.

Benign case MRE

Created an minimal reproducible example for the "benign case"; i.e. where we see an impact on latency without a deadlock.

use std::time::Duration;
use std::time::Instant;

async fn handle_request(start_time: Instant) {
    tokio::time::sleep(Duration::from_millis(500)).await;
    println!("request took {}ms", start_time.elapsed().as_millis());
}

async fn background_job() {
    loop {
        tokio::time::sleep(Duration::from_secs(3)).await;

        println!("background job start");
        // Adjust as needed
        for _ in 0..1_000 { vec![1; 1_000_000].sort() }
        println!("background job finished");
    }
}

#[tokio::main]
async fn main() {
    tokio::spawn(async { background_job().await });

    loop {
        let start = Instant::now();
        tokio::spawn(async move {
            handle_request(start).await
        }).await.unwrap();
    }
}

Output:

request took 501ms
request took 501ms
request took 501ms
request took 501ms
request took 500ms
background job start
background job finished
request took 1620ms        <==== high-latency request
request took 501ms

Reproduced the issue on a arm server and the rust-playground.
I do not know what rust-playground uses, but the server has 32 cores.

Possible solution 1

Handing over the the driver before running the task seems work. Not particularly familiar with the codebase, but after reading the comments above a couple time and going through the code I noticed that notify_parked_remote() seems to do what we need; i.e. it wakes-up a worker, handing them the driver in passing. So tried adding a call to notify_parked_remote() before the task runs and it does work in the sense that "cargo test" is happy and I'm no longer able to reproduce the issue.

i.e. made the change below to worker.rs:573.

        // Make the core available to the runtime context
        *self.core.borrow_mut() = Some(core);

+      // Probably the wrong way to do this
+      self.worker.handle.notify_parked_remote();
        // Run the task
       ...

Possible solution 2

As mentioned in other comments, creating a background thread that periodically launches threads does mitigate the issue (a workaround).

Random thoughts

(not particularly well informed)

From the comments above think everyone agrees that "this is not a good thing"; perhaps a sacrifice that is made to get some other (good) properties in return. Without knowing how much this affects real-world workloads it's hard to say how big of an issue this is. Overall think this behavior is undesirable and counter-intuitive for a work-stealing scheduler. Guess we can have a separate issue called "one benign task can halt all executor progress for a while".

Wonder if the tracing feature or tokio-metrics provides some visibility into whether this issue is affecting a given workload or not. It's very possible that I missed such a metric/span/event but this might be a blindspot. Idk in the past I've seen some high latency requests that I could not explain, it would be useful to have a metric we could use to assess the concrete impact of this issue in the real world. A proxy for how much time the io-driver is both 1) ready and 2) held by a worker that is running a task.

I tried the multi-threaded-alt scheduler and the issue is also present there; but wonder if the alt-scheduler would be the right place to ship a fix for this issue. The alt scheduler is only available on unstable; which means it's already behind an option. (just an idea; do not have much context here).

exabytes18 commented 11 months ago

I'll +1 this.

This is a pretty ugly performance wart. Our team has independently hit this issue twice, where some singular task takes longer than the developer expects and this results in 100ms+ of latency for API requests that we otherwise expect to complete very quickly.

The fact that one pathological task is able to harm a bunch of otherwise good API requests is alarming behavior.

lewiszlw commented 11 months ago

Test code pasted by @santossumii using async-std runtime.

fn main() {
    async_std::task::block_on(async {
        async_std::task::spawn(async { background_job().await });

        loop {
            let start = Instant::now();
            async_std::task::spawn(async move { handle_request(start).await }).await;
        }
    });
}

async fn handle_request(start_time: Instant) {
    async_std::task::sleep(Duration::from_millis(500)).await;
    println!("request took {}ms", start_time.elapsed().as_millis());
}

async fn background_job() {
    loop {
        async_std::task::sleep(Duration::from_secs(3)).await;

        println!("background job start");
        // Adjust as needed
        for _ in 0..1_000 {
            vec![1; 1_000_000].sort()
        }
        println!("background job finished");
    }
}

output

request took 500ms
request took 501ms
request took 501ms
background job start
request took 501ms
request took 501ms
request took 501ms
request took 501ms
request took 501ms
request took 501ms
request took 501ms
request took 501ms
request took 501ms
request took 501ms
request took 501ms
request took 501ms
request took 501ms
request took 500ms
background job finished
request took 501ms
request took 500ms

The heavy compute task won't halt other tasks. I don't know the implementation of tokio, but seems this is a big problem as users are very likely to hit this issue.

Congyuwang commented 11 months ago

Except for using spawn_blocking to spawn it to another thread, using block_in_place also works:

use std::time::Duration;
use std::time::Instant;

async fn handle_request(start_time: Instant) {
    tokio::time::sleep(Duration::from_millis(500)).await;
    println!("request took {}ms", start_time.elapsed().as_millis());
}

async fn background_job() {
    loop {
        tokio::time::sleep(Duration::from_secs(3)).await;

        println!("background job start");
        // Adjust as needed
        tokio::task::block_in_place(|| {
            for _ in 0..1_000 {
                vec![1; 1_000_000].sort()
            }
        });
        println!("background job finished");
    }
}

#[tokio::main]
async fn main() {
    tokio::spawn(async { background_job().await });

    loop {
        let start = Instant::now();
        tokio::spawn(async move { handle_request(start).await })
            .await
            .unwrap();
    }
}

Output:

request took 502ms
request took 502ms
request took 502ms
background job start
request took 500ms
background job finished
request took 502ms
request took 502ms
request took 501ms
request took 502ms
request took 502ms
request took 501ms
background job start
request took 502ms
background job finished
request took 502ms
request took 502ms

The documentation of block_in_place says:

    /// Calling this function informs the
    /// executor that the currently executing task is about to block the thread,
    /// so the executor is able to hand off any other tasks it has to a new
    /// worker thread before that happens.

If there are more running block_in_place than the cpu cores, there would still be large latencies, which however is impossible to avoid as it is fundamentally caused by limited cpu resources.

Don't know how async-std is dealing with blocking tasks. For Tokio, the responsibility of dealing with blocking tasks is on the programmer.

lewiszlw commented 11 months ago

Good workaround. But we can not control our third-party libs if they have bad tasks.

yuyang-ok commented 11 months ago

is tokio going to fix this or what?

Darksonn commented 11 months ago

Look, it isn't so simple. All potential solutions have disadvantages to them. They hurt the performance of well-behaved programs, and also raise questions about forwards-compatibility.

If you want to help, you could investigate how other runtimes handle this problem. For example, what does (did?) Go do if the thread with the epoll instance gets blocked by a tight loop without yield points, and all other threads are asleep? (Runtimes that poll IO from a dedicated thread are not relevant. We are not going to do that.)

exabytes18 commented 11 months ago

This feels somewhat similar to Spectre and Meltdown where Intel was "caught speeding" -- their optimizations were too aggressive for the happy-path (i.e. CPU only loaded with trusted, well-behaved programs) and that opened the door to other issues. Similarly, everything is great with tokio, until it's not.

Common in software engineering, it's not so much the performance or operation of a well-behaved program that matters, but dealing with the exceptional-cases or nefarious actors that requires so much thought. This might be where we're at here.

It would be interesting to know how much of a performance hit the potential solutions have and if it would be possible to feature-flag it. We could benchmark our own app, though it's probably not representative of other workloads, and we wouldn't be able to take this work item until the new year.

Darksonn commented 11 months ago

This problem is something you get "by default" if you do it the obvious way. It isn't because we have an extra optimization that is invalid.

and we wouldn't be able to take this work item until the new year.

You should not expect a fix for this over the holidays anyway. It's nowhere near that urgent. And there is even the workaround of spawning your own monitoring thread, as I mentioned in a previous comment.

carllerche commented 9 months ago

As @darkson mentioned, there is no obvious solution to this that comes without additional overhead. Fundamentally, Tokio is a cooperative multi-tasking system, which puts some responsibility on the user. As such, I will close this issue as there is no "bug" in Tokio.

That said, there would be value in detecting and possibly mitigating poorly behaved tasks on an opt-in basis. I opened #6315 as a tracking issue and referenced this issue. This will help reframe the conversation.

tokio-rs / tokio