Closed justinlatimer closed 6 years ago
/cc @srijs Any chance I can convince you to take a look? 😸
So the big change looks to be the timer integration. I know that trust-dns has tickled various areas of timers in the past.
The specific test that's hanging is indeed timeout related, it can be run with:
$> cargo test --manifest-path integration-tests/Cargo.toml --test client_future_tests test_timeout_query_nonet
Some of the other timeout tests are passing, so it's interesting this mocked one is failing.
Ok, I tracked this down to where the hang is occuring: https://github.com/bluejekyll/trust-dns/blob/master/proto/src/xfer/dns_future.rs#L217-L224
The last time we had an issue in this area it was because the Timeout wasn't being polled, in this case we end up in an infinite loop polling and always getting NotReady. Looking through the code, I don't see any obvious places where the Timeout poll result is accidentally being dropped/ignored.
I‘m happy to take a look later today!
On Thu, Apr 12, 2018, at 12:29 PM, Benjamin Fry wrote:
Ok, I tracked this down to where the hang is occuring: https://github.com/bluejekyll/trust-dns/blob/master/proto/src/xfer/dns_future.rs#L217-L224> The last time we had an issue in this area it was because the Timeout> wasn't being polled, in this case we end up in an infinite loop polling> and always getting NotReady. Looking through the code, I don't see any> obvious places where the Timeout poll result is accidentally being dropped/ignored.
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/tokio-rs/tokio-core/issues/322#issuecomment-380655019
I think I'm getting an idea of what the problem is.
Setting a breakpoint on Timer::process_queue
reveals that it is never called, because the executor never actually parks the timer. Since it's not doing that, the timer never gets to actually insert the queued entry into the timer wheel.
So why is the timer never parked? I suspect it's due to the impl Stream
for NeverReturnsClientStream
, which calls task::current().notify()
on every poll. IIUC that will schedule a notification to be processed directly after poll
returns, therefore the executor will always be busy processing/polling and never park the thread.
Looking at the old implementation of the timer wheel in tokio-core
, it seems that consume_timeouts
was called on every turn of the event loop, and not just when parking, which would prevent this scenario where a constantly notifying task effectively starves the timeout processing.
Replicating this behaviour with the new tokio-timer
might require exposing a non-blocking variant of Timer::turn
, which could be used by tokio-core
(simiar to how the Timer
API in futures-timer
works). However this would effectively render the clever Park
implementation useless, imo.
The alternative I suppose would be to document the fact that a high notify volume can starve timers, but I'm not sure how desirable this is?
Ah thanks for looking into this. They way you describe the problem makes total sense.
I believe that this PR would actually fix the issue https://github.com/tokio-rs/tokio/pull/313. It would result in Park being called each iteration which will fire timers.
Yeah, that PR looks like it could be useful here! I'll try to test it out later.
I’ve been thinking about this, and I’m pretty concerned at the moment. I use the notify() solution for rescheduling futures in a lot of different areas. In most cases there is usually a poll on some io event that eventually breaks that loop, but I’ll have to review every usecase at this point.
In the specific test in question, I can’t think of a decent alternative without writing a ton more code. Perhaps the mock should instead of notify(), schedule a timer to poll on... that should fix this issue.
But I want to understand something, do we currently believe 0.1.17 to be flawed or not?
edit: I have successfully worked around this by using a Timeout
in the mock object instead of the task::current().notify()
I've tried to apply the patch from tokio-rs/tokio#313 using the following my Cargo.toml
:
[replace]
"tokio-executor:0.1.2" = { git = "https://github.com/sdroege/tokio.git", branch="fair-turn" }
The test still hung with the patch applied, unless there was something else I needed to modify.
https://github.com/bluejekyll/trust-dns/pull/402 should work, doesn't need the "fair-turn" patch.
Ok, I think this should be resolved now? If not, please let me know.
It looks like updating
tokio-core
to 0.1.17 causes one of the integration tests in trust-dns relating to timeouts to hang.Steps to reproduce:
scripts/run_tests.sh
(this should work,tokio-core
is 0.1.16 inCargo.lock
cargo update -p tokio-core
scripts/run_tests.sh
- this test hangs: https://github.com/bluejekyll/trust-dns/blob/master/integration-tests/tests/client_future_tests.rs#L897