tokio-rs / tokio

A runtime for writing reliable asynchronous applications with Rust. Provides I/O, networking, scheduling, timers, ...
https://tokio.rs
MIT License
26.79k stars 2.47k forks source link

Panic in `TimerHandle::mark_pending` #5989

Open lkts opened 1 year ago

lkts commented 1 year ago

Version 1.29.1

Platform 4.14.322-244.536.amzn2.x86_64

Description We are observing a panic in StateCell::mark_pending https://github.com/tokio-rs/tokio/blob/a6be73eecbb2646549182443decdd433fb791ccf/tokio/src/runtime/time/entry.rs#L183

Stack trace is:

tokio::runtime::time::wheel::Wheel::poll

tokio::runtime::time::<impl tokio::runtime::time::handle::Handle>::process_at_time

tokio::runtime::time::Driver::park_internal

tokio::runtime::scheduler::multi_thread::park::Parker::park

tokio::runtime::scheduler::multi_thread::worker::Context::park_timeout

tokio::runtime::scheduler::multi_thread::worker::Context::run

tokio::runtime::context::scoped::Scoped<T>::set

tokio::runtime::context::runtime::enter_runtime

tokio::runtime::scheduler::multi_thread::worker::run

tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut

tokio::runtime::task::core::Core<T,S>::poll

tokio::runtime::task::harness::Harness<T,S>::poll

tokio::runtime::blocking::pool::Inner::run

It is not yet clear what impact it causes. We observe this panic infrequently - single occurrence in multiple days.

Darksonn commented 1 year ago

I can try to look at the code to see if I can find any problem, but if you have any more information, then that would be very helpful.

lkts commented 1 year ago

Unfortunately that's pretty much all i have. I know about this panic because i have a panic hook and noticed it in logs, it's hard to correlate it to any specific event.

One thing i noticed in the code is that when we set expiration of the timer we check validity of expiration timestamp but only with a debug assert. https://github.com/tokio-rs/tokio/blob/a6be73eecbb2646549182443decdd433fb791ccf/tokio/src/runtime/time/entry.rs#L243

If i understand correctly we would have more useful data from this assert rather than from current one.

Darksonn commented 1 year ago

Would you be able to change it to an assert! in your environment to try and get more information on the error?

carllerche commented 1 year ago

@Darksonn we probably should add more details in our assert and also change those debug asserts to asserts for the time being.

seeekr commented 2 weeks ago

We're observing these with high probability after an uptime of 10-14 days of our services. In our case this takes down the entire service and it needs to be restarted.

seeekr commented 2 weeks ago

And perhaps a dumb question, but: If we don't have any unsafe code in our crates, this assert can only trigger from a dependency or if there's some bug in tokio's timer handling? Trying to figure out what might be the root cause of this issue and how to resolve it. It'd be pretty great to not have to restart services every <10 days.

Darksonn commented 2 weeks ago

I can be a third-party dependency with incorrect unsafe code. For example, if futures are moved while pinned, it can lead to this error. But that can only happen if someone uses incorrect unsafe code.

Of course it can also be a Tokio bug.

What dependencies are you using?