tokio-rs / tokio

A runtime for writing reliable asynchronous applications with Rust. Provides I/O, networking, scheduling, timers, ...
https://tokio.rs
MIT License
25.44k stars 2.3k forks source link

Internal panics in `CurrentThread::block_on()` under WSLv1 #6620

Open mqudsi opened 1 month ago

mqudsi commented 1 month ago

Version

tokio v1.38.0

Platform

WSL installation of Ubuntu 23.10 Linux 4.4.0-19041-Microsoft #4355-Microsoft Thu Apr 12 17:37:00 PST 2024 x86_64 x86_64 x86_64 GNU/Linux

Description

I have a project using tokio for async process support and it has been rock-solid and stable for years, but has recently begun to crash deep inside calls to Runtime::block_on() (using the current_thread runtime), mostly (see below) with the following error:

Message: unexpected error when polling the I/O driver: Os { code: 14, kind: Uncategorized, message: "Bad address" }

It's a large project and next to impossible to isolate this issue, but knowing there would be a suggestion this is the application itself at fault, I compiled with ASAN enabled for rust and the C dependencies, but no issues were reported. (Running under miri is impossible and out of the question). I was able to reproduce the crash with a debug ASAN build (previous errors were compiled as release), but this time the error was

Message:  unexpected error when polling the I/O driver: Os { code: 22, kind: InvalidInput, message: "Invalid argument" }

I have attached the full stack trace for both exceptions:

Darksonn commented 1 month ago

I'm thinking of #6424. Do you have records of which packages were upgraded since it became unstable? Or whether you changed rustc version?

mqudsi commented 1 month ago

Thanks for the quick reply! I check in my Cargo.lock so I can actually answer that question. I've attached the diff of a commit from June 2nd, which was also the same time I rebuilt the binary (with a newer rustc) since perhaps February 2023; and I'm pretty sure that's when I started seeing the crashes. I would have previously been using the nightly compiler from ~around that time vs rustc 1.80.0-nightly (faefc618c 2024-05-07) I'm using now.

Cargo.lock.diff.txt

Darksonn commented 1 month ago

It's very likely to be caused by either the upgrade of rustc, or the upgrade of a dependency. It would be extremely helpful if you could help us narrow it down; as you can see, other people have also reported problems, and I do not have access to any way to actually reproduce it.

To start off, would you be able to revert the change to Cargo.lock but keep the compiler upgrade and see whether that makes the problem go away?

mqudsi commented 1 month ago

Sure, that's doable. I've reverted the changes to Cargo.lock and made sure to install the binary with --locked. Note that I did have to upgrade the following packages as they no longer compile under rustc 1.80.0-nightly, but they're not dependencies of tokio so I don't think that would be a problem:

mqudsi commented 1 month ago

Ok, it crashed with SIGSEGV, which is new. I'll try again with ASAN.

mqudsi commented 1 month ago

Running under ASAN doesn't report any memory corruption our out-of-bound reads/writes until something (tokio, I think) tries to access an invalid address:

AddressSanitizer:DEADLYSIGNAL
=================================================================
==19006==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000050 (pc 0x7f832464c596 bp 0x7fffc0710a20 sp 0x7fffc0710910 T0)
==19006==The signal is caused by a READ memory access.
==19006==Hint: address points to the zero page.
    #0 0x7f832464c596  (/home/mqudsi/.cargo/bin/dedup-rs+0x8ab596) (BuildId: f27f352a6f71c605da9d40af3da2c0c5ab5befa0)
    #1 0x7f832460ec85  (/home/mqudsi/.cargo/bin/dedup-rs+0x86dc85) (BuildId: f27f352a6f71c605da9d40af3da2c0c5ab5befa0)
    #2 0x7f83246123c3  (/home/mqudsi/.cargo/bin/dedup-rs+0x8713c3) (BuildId: f27f352a6f71c605da9d40af3da2c0c5ab5befa0)
    #3 0x7f832461e1a5  (/home/mqudsi/.cargo/bin/dedup-rs+0x87d1a5) (BuildId: f27f352a6f71c605da9d40af3da2c0c5ab5befa0)
    #4 0x7f832463fcea  (/home/mqudsi/.cargo/bin/dedup-rs+0x89ecea) (BuildId: f27f352a6f71c605da9d40af3da2c0c5ab5befa0)
    #5 0x7f832464b60d  (/home/mqudsi/.cargo/bin/dedup-rs+0x8aa60d) (BuildId: f27f352a6f71c605da9d40af3da2c0c5ab5befa0)
    #6 0x7f832463fd3b  (/home/mqudsi/.cargo/bin/dedup-rs+0x89ed3b) (BuildId: f27f352a6f71c605da9d40af3da2c0c5ab5befa0)
    #7 0x7f832465dff0  (/home/mqudsi/.cargo/bin/dedup-rs+0x8bcff0) (BuildId: f27f352a6f71c605da9d40af3da2c0c5ab5befa0)
    #8 0x7f832465d3b3  (/home/mqudsi/.cargo/bin/dedup-rs+0x8bc3b3) (BuildId: f27f352a6f71c605da9d40af3da2c0c5ab5befa0)
    #9 0x7f83243dd266  (/home/mqudsi/.cargo/bin/dedup-rs+0x63c266) (BuildId: f27f352a6f71c605da9d40af3da2c0c5ab5befa0)
    #10 0x7f83243c392a  (/home/mqudsi/.cargo/bin/dedup-rs+0x62292a) (BuildId: f27f352a6f71c605da9d40af3da2c0c5ab5befa0)
    #11 0x7f83243805cb  (/home/mqudsi/.cargo/bin/dedup-rs+0x5df5cb) (BuildId: f27f352a6f71c605da9d40af3da2c0c5ab5befa0)
    #12 0x7f83243882c3  (/home/mqudsi/.cargo/bin/dedup-rs+0x5e72c3) (BuildId: f27f352a6f71c605da9d40af3da2c0c5ab5befa0)
    #13 0x7f83243c80f0  (/home/mqudsi/.cargo/bin/dedup-rs+0x6270f0) (BuildId: f27f352a6f71c605da9d40af3da2c0c5ab5befa0)
    #14 0x7f83243bea21  (/home/mqudsi/.cargo/bin/dedup-rs+0x61da21) (BuildId: f27f352a6f71c605da9d40af3da2c0c5ab5befa0)
    #15 0x7f8324420653  (/home/mqudsi/.cargo/bin/dedup-rs+0x67f653) (BuildId: f27f352a6f71c605da9d40af3da2c0c5ab5befa0)
    #16 0x7f8324b5611c  (/home/mqudsi/.cargo/bin/dedup-rs+0xdb511c) (BuildId: f27f352a6f71c605da9d40af3da2c0c5ab5befa0)
    #17 0x7f83244205f2  (/home/mqudsi/.cargo/bin/dedup-rs+0x67f5f2) (BuildId: f27f352a6f71c605da9d40af3da2c0c5ab5befa0)
    #18 0x7f8323a3814f  (/lib/x86_64-linux-gnu/libc.so.6+0x2814f) (BuildId: 6a981b07a3731293c24c10a21397416d3c3d52ed)
    #19 0x7f8323a38208  (/lib/x86_64-linux-gnu/libc.so.6+0x28208) (BuildId: 6a981b07a3731293c24c10a21397416d3c3d52ed)
    #20 0x7f83242b7484  (/home/mqudsi/.cargo/bin/dedup-rs+0x516484) (BuildId: f27f352a6f71c605da9d40af3da2c0c5ab5befa0)

The ASAN stack trace isn't symbolized (though the release profile specifies debug=2), so here's a symbolification of it:

#0 0x8ab596: src/util/slab.rs:231 starts at address 0x8ab574 tokio::util::slab::Slab<T>::get+99 and ends at 0x8ab599 tokio::util::slab::Slab<T>::get+136
#1 0x86dc85: src/runtime/io/mod.rs:196 starts at address 0x86dc7e tokio::runtime::io::Driver::turn+577 and ends at 0x86dc8c tokio::runtime::io::Driver::turn+591
#2 0x8713c3: src/runtime/io/mod.rs:227 starts at address 0x8713b7 <tokio::signal::unix::driver::Driver as tokio::park::Park>::park+4 and ends at 0x8713c7 <tokio::signal::unix::driver::Driver as tokio::park::Park>::park+20
#3 0x87d1a5: src/process/unix/driver.rs:44 starts at address 0x87d1a0 <tokio::process::imp::driver::Driver as tokio::park::Park>::park+7 and ends at 0x87d1ae <tokio::process::imp::driver::Driver as tokio::park::Park>::park+21
#4 0x89ecea: src/park/either.rs:30 starts at address 0x89ece5 <tokio::park::either::Either<A,B> as tokio::park::Park>::park+46 and ends at 0x89eceb <tokio::park::either::Either<A,B> as tokio::park::Park>::park+52
#5 0x8aa60d: src/time/driver/mod.rs:238 starts at address 0x8aa608 tokio::time::driver::Driver<P>::park_internal+541 and ends at 0x8aa619 tokio::time::driver::Driver<P>::park_internal+558
#6 0x89ed3b: src/time/driver/mod.rs:436 starts at address 0x89ed31 <tokio::park::either::Either<A,B> as tokio::park::Park>::park+49 and ends at 0x89ed4a <tokio::park::either::Either<A,B> as tokio::park::Park>::park+74
#7 0x8bcff0: src/runtime/driver.rs:198 starts at address 0x8bcfeb tokio::runtime::scheduler::current_thread::Context::enter+292 and ends at 0x8bcff1 tokio::runtime::scheduler::current_thread::Context::enter+298
#8 0x8bc3b3: src/runtime/scheduler/current_thread.rs:307 starts at address 0x8bc3a8 tokio::runtime::scheduler::current_thread::Context::park+465 and ends at 0x8bc3bc tokio::runtime::scheduler::current_thread::Context::park+485
#9 0x63c266: /home/mqudsi/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.21.0/src/runtime/scheduler/current_thread.rs:554 starts at address 0x63c25e tokio::runtime::scheduler::current_thread::CurrentThread::block_on+3154 and ends at 0x63c279 tokio::runtime::scheduler::current_thread::CurrentThread::block_on+3181
#10 0x62292a: /home/mqudsi/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.21.0/src/runtime/mod.rs:490 starts at address 0x62291c tokio::runtime::Runtime::block_on+711 and ends at 0x62293c tokio::runtime::Runtime::block_on+743
#11 0x5df5cb: src/dedup.rs:105 starts at address 0x5df5ba <dedup_rs::db::FuzzyDuplicate as dedup_rs::dedup::DuplicateItem>::menu_line+1606 and ends at 0x5df5d0 <dedup_rs::db::FuzzyDuplicate as dedup_rs::dedup::DuplicateItem>::menu_line+1628
#12 0x5e72c3: src/dedup.rs:315 starts at address 0x5e72b8 dedup_rs::dedup::deduplicate_fuzzy+7656 and ends at 0x5e72cb dedup_rs::dedup::deduplicate_fuzzy+7675
#13 0x6270f0: src/main.rs:223 starts at address 0x6270e6 dedup_rs::main+6066 and ends at 0x6270fd dedup_rs::main+6089
#14 0x61da21: /rustc/faefc618cf48bd794cbc808448df1bf3f59f36af/library/core/src/ops/function.rs:250 starts at address 0x61da20 std::sys_common::backtrace::__rust_begin_short_backtrace+1 and ends at 0x61da22 std::sys_common::backtrace::__rust_begin_short_backtrace+3
#15 0x67f653: /rustc/faefc618cf48bd794cbc808448df1bf3f59f36af/library/std/src/rt.rs:159 starts at address 0x67f64f std::rt::lang_start::{{closure}}+20 and ends at 0x67f654 std::rt::lang_start::{{closure}}+25
#16 0xdb511c: library/core/src/ops/function.rs:284 starts at address 0xdb511a std::rt::lang_start_internal+1082 and ends at 0xdb511d std::rt::lang_start_internal+1085
#17 0x67f5f2: /rustc/faefc618cf48bd794cbc808448df1bf3f59f36af/library/std/src/rt.rs:158 starts at address 0x67f5e6 std::rt::lang_start+191 and ends at 0x67f5f3 std::rt::lang_start+204
Darksonn commented 1 month ago

In that case it sounds like a compiler bug. If you try the old compiler again, does the problem go away? If so, then you should be able to bisect to the nightly that introduced the problem by checking only 7 compilers.

mqudsi commented 1 month ago

Unfortunately easier said than done. It's an interactive application that requires user input and performs non-repeatable work. The crashes generally don't happen right away (though that has occurred); I'll probably need to try and script it so that I can leave it to run automated for some amount of time and see if it segfaults. I don't know if the idle "waiting for user input" state contributes to the bug, in which case it might not reproduce under automation. But we'll see.

mqudsi commented 1 month ago

I don't think it's a compiler regression; I've managed to reproduce it with the nightly-2022-09-01 toolchain. It reproduces under ASAN but not under valgrind (so probably timing dependent) but neither complain of any memory corruption. I commented out a huge chunk of the code in the process of automating and speeding up the bisection runs and it still happens. The stack trace from tokio 1.21 is attached. (2023-02-01 tokio 1.21 crash.txt)

It's either an invalid argument or bad address tokio error under 1.38 or event store missing under 1.21 (or sigsegv without a tokio message, but still in the tokio stack). Haven't seen any crashes outside the tokio stack trace, but obviously that isn't definitive proof of anything.

As mentioned, this is under WSLv1. Is there anything I could try to debug the tokio internals here?

mqudsi commented 4 weeks ago

I was able to compile the same binary on the same machine and run it against the same workload but as a Windows executable rather than as a Linux one (thanks, WSL) and run that for an extended test without any issues, so it seems to be something in the tokio's unix code that is triggering this. (My own code is ~the same between Windows and Linux). Also for what it's worth, this is a machine with ECC ram.

Darksonn commented 4 weeks ago

Are you able to find any configuration on Linux where the crash doesn't happen?