Closed drmingdrmer closed 2 years ago
Thanks for the report. It seems like this is exposing both a problem in Tokio and a problem in h2.
shutdown
here for tasks that are already on the runtime.FuturesUnordered
can introduce their own wakers, and they might hold a ref-count on the future its going to wake. If the waker getting waked consumes the last ref-count, this would cause that future to get dropped, which is exactly what happened here.Alice is currently working to try create a repro for the Tokio side of the bug.
@drmingdrmer Are you able to post a complete code example that runs into the deadlock?
It looks like we have a repro
A fix has been posted as a PR: https://github.com/tokio-rs/tokio/pull/3870 are you able to verify that it fixes your deadlock?
@carllerche
Confirmed that this issue is fixed, thanks.
But we have fixed it by this: https://github.com/datafuselabs/datafuse/pull/841/files
@drmingdrmer Are you able to post a complete code example that runs into the deadlock?
https://github.com/datafuselabs/datafuse/pull/839/checks?check_run_id=2836068821
This is our CI that has this problem. Clone the repo and check out b7e6bb13772104cb2544bb200f0ce1b95f247932 and cargo test
shows up this deadlock.
So is there anything more to do here?
@nox I may have missed it, but as far as I know the second point in https://github.com/hyperium/h2/issues/546#issuecomment-864134808 has not yet been addressed in h2.
@nox Thanks for reminding!
tokio fixed this issue in: https://github.com/tokio-rs/tokio/pull/3870
Let's close it.
With the latest tokio 1.7.0 there is a deadlock when tokio runtime is shutting down.
lib versions:
The problem
The problem code snippet is a unit test that brings up a grpc server and a client that keeps sending RPC to the server in another tokio task.
When the test quits(and tokio runtime is shutting down), the task that keeps sending RPC is still running. Then there is a deadlock that hangs the world and never quits.
The same codes work fine with tokio 1.6.0; Since in 1.7.0 a new feature is added: https://github.com/tokio-rs/tokio/pull/3752 which I believe causes this problem.
The detail
The deadlock happens when the tokio runtime is shutting down and trying to drop a stream: in
src/proto/streams/streams.rs
, it acquired the lock of the stream to do some cleanup jobs.Then while holding the lock
me
,maybe_cancel()
tries to wake up the task this stream belongs to.Because tokio runtime is closed thus another round of dropping happens. Finally in
src/proto/connection.rs
, it tried again to acquire the same lock to release resource. deadlock.All these happens in one thread with tokio 1.7.0 .
Stack summary when deadlock:
The entire backtrace(first lock acquire is at frame 53):