tokio-rs / tokio-core

I/O primitives and event loop for async I/O in Rust
Apache License 2.0
638 stars 115 forks source link

Updating to 0.1.17 causes a hang relating to timeouts #322

Closed justinlatimer closed 6 years ago

justinlatimer commented 6 years ago

It looks like updating tokio-core to 0.1.17 causes one of the integration tests in trust-dns relating to timeouts to hang.

Steps to reproduce:

  1. Checkout https://github.com/bluejekyll/trust-dns
  2. Run scripts/run_tests.sh (this should work, tokio-core is 0.1.16 in Cargo.lock
  3. cargo update -p tokio-core
  4. Run scripts/run_tests.sh - this test hangs: https://github.com/bluejekyll/trust-dns/blob/master/integration-tests/tests/client_future_tests.rs#L897
carllerche commented 6 years ago

/cc @srijs Any chance I can convince you to take a look? 😸

bluejekyll commented 6 years ago

So the big change looks to be the timer integration. I know that trust-dns has tickled various areas of timers in the past.

The specific test that's hanging is indeed timeout related, it can be run with:

$> cargo test --manifest-path integration-tests/Cargo.toml --test client_future_tests test_timeout_query_nonet

Some of the other timeout tests are passing, so it's interesting this mocked one is failing.

bluejekyll commented 6 years ago

Ok, I tracked this down to where the hang is occuring: https://github.com/bluejekyll/trust-dns/blob/master/proto/src/xfer/dns_future.rs#L217-L224

The last time we had an issue in this area it was because the Timeout wasn't being polled, in this case we end up in an infinite loop polling and always getting NotReady. Looking through the code, I don't see any obvious places where the Timeout poll result is accidentally being dropped/ignored.

srijs commented 6 years ago

I‘m happy to take a look later today!

On Thu, Apr 12, 2018, at 12:29 PM, Benjamin Fry wrote:

Ok, I tracked this down to where the hang is occuring: https://github.com/bluejekyll/trust-dns/blob/master/proto/src/xfer/dns_future.rs#L217-L224> The last time we had an issue in this area it was because the Timeout> wasn't being polled, in this case we end up in an infinite loop polling> and always getting NotReady. Looking through the code, I don't see any> obvious places where the Timeout poll result is accidentally being dropped/ignored.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/tokio-rs/tokio-core/issues/322#issuecomment-380655019

srijs commented 6 years ago

I think I'm getting an idea of what the problem is.

Setting a breakpoint on Timer::process_queue reveals that it is never called, because the executor never actually parks the timer. Since it's not doing that, the timer never gets to actually insert the queued entry into the timer wheel.

So why is the timer never parked? I suspect it's due to the impl Stream for NeverReturnsClientStream, which calls task::current().notify() on every poll. IIUC that will schedule a notification to be processed directly after poll returns, therefore the executor will always be busy processing/polling and never park the thread.

Looking at the old implementation of the timer wheel in tokio-core, it seems that consume_timeouts was called on every turn of the event loop, and not just when parking, which would prevent this scenario where a constantly notifying task effectively starves the timeout processing.

Replicating this behaviour with the new tokio-timer might require exposing a non-blocking variant of Timer::turn, which could be used by tokio-core (simiar to how the Timer API in futures-timer works). However this would effectively render the clever Park implementation useless, imo.

The alternative I suppose would be to document the fact that a high notify volume can starve timers, but I'm not sure how desirable this is?

carllerche commented 6 years ago

Ah thanks for looking into this. They way you describe the problem makes total sense.

I believe that this PR would actually fix the issue https://github.com/tokio-rs/tokio/pull/313. It would result in Park being called each iteration which will fire timers.

srijs commented 6 years ago

Yeah, that PR looks like it could be useful here! I'll try to test it out later.

bluejekyll commented 6 years ago

I’ve been thinking about this, and I’m pretty concerned at the moment. I use the notify() solution for rescheduling futures in a lot of different areas. In most cases there is usually a poll on some io event that eventually breaks that loop, but I’ll have to review every usecase at this point.

In the specific test in question, I can’t think of a decent alternative without writing a ton more code. Perhaps the mock should instead of notify(), schedule a timer to poll on... that should fix this issue.

But I want to understand something, do we currently believe 0.1.17 to be flawed or not?

edit: I have successfully worked around this by using a Timeout in the mock object instead of the task::current().notify()

justinlatimer commented 6 years ago

I've tried to apply the patch from tokio-rs/tokio#313 using the following my Cargo.toml:

[replace]
"tokio-executor:0.1.2" = { git = "https://github.com/sdroege/tokio.git", branch="fair-turn" }

The test still hung with the patch applied, unless there was something else I needed to modify.

bluejekyll commented 6 years ago

https://github.com/bluejekyll/trust-dns/pull/402 should work, doesn't need the "fair-turn" patch.

carllerche commented 6 years ago

Ok, I think this should be resolved now? If not, please let me know.