Open hds opened 9 months ago
Hmm, do we know why the wake operation is no longer inside the task's span? What changed upstream in tokio
that resulted in the task span being exited before the wakeup?
Edit: oh haha i can't read, i missed the first paragraph that references tokio-rs/tokio#5223
As a side note, it would be really cool if we had some way to run automated tests against new tokio
releases, in order to proactively detect issues where upstream changes break our instrumentation. I wonder if we could get some kind of webhook that could trigger a GitHub Actions workflow when a new Tokio release is published. This might be a bit of a big project, and it probably deserves its own tracking ticket, but I figured I'd mention it here.
Thinking about this...I honestly think it might be okay to just exclude task::yield_now()
from the self-wake warning, because of the changes in tokio-rs/tokio#5223.
The primary reason we want to warn about tasks self-waking is that excessive self-wakes can result in a task starving the runtime by waking itself, getting put in the LIFO slot, and immediately executing again. Since tokio-rs/tokio#5223 changes yield_now()
to wake the task to a separate queue that gets polled only after tasks woken by IO, timers, and synchronization primitives, it no longer presents as much of a risk of starving the runtime, compared to user-implemented self-wakes (e.g. a Future::poll
implementation that immediately wakes its own Waker
). Honestly, it might be fine for yield_now()
to not trigger the self-wake lint after Tokio 1.23.
@hds what do you think?
@hawkw that's a good point regarding yield_now()
not being problematic after the defer changes.
In that case it's more a question of changing the tests and documenting the limitation in self wake counting. I'll do that and then we can probably close this issue.
Thanks!
Sounds good to me, thanks for flagging this!
As a side note, it would be really cool if we had some way to run automated tests against new tokio releases, in order to proactively detect issues where upstream changes break our instrumentation. I wonder if we could get some kind of webhook that could trigger a GitHub Actions workflow when a new Tokio release is published. This might be a bit of a big project, and it probably deserves its own tracking ticket, but I figured I'd mention it here.
I missed this first comment yesterday.
This is a good idea, although I think we might want to test against tokio master
to catch these issues before the release. I was thinking a weekly action that runs against tokio master as well as always running against the latest version (I suppose this should be in addition to running against some "minimum stable tokio version") in our PRs and merges to main.
In a certain way this might be easier than the webhook - although having that webhook to catch things on the actual release would also be really nice.
(edit(2024-01-24): Updated title and description to clarfiy that it is only self wakes coming from
tokio::task::yield_now()
that are affected)What is the issue?
When observing an instrumented program built with Tokio 1.23 or later, Tokio Console isn't able to detect self wakes coming from
tokio::task::yield_now()
.This is due to tokio-rs/tokio#5223, where the self wake was changed to be deferred so that self waking tasks wouldn't starve the resource drivers.
How can the bug be reproduced?
The following code should show a task with a single self wake:
If build with Tokio 1.22.0 then we see the wake counted as a self wake in Tokio Console. We can confirm this using
ari-subscriber
where we get the output (truncated):Specifically there is a wake operation (
op="waker.wake_by_ref"
) inside theruntime.spawn
span.When updating to Tokio 1.23.0, Tokio Console no longer counts the single wake as a self wake. Again, we can confirm using
ari-subscriber
(truncated output):Here we see that the waker is cloned (
op="waker.clone"
) within theruntime.spanw
span for the task, but the wake itself happens by value after the span has exited (op="waker.wake"
).Logs, error output, etc
No response
Versions
Tokio 1.23
Possible solution
We may need to modify the instrumentation in Tokio itself to be able to detect a self wake. Since Tokio knows that it is performing a self wake (the deferred waking code is only used from
tokio::task::yield_now()
), it may be possible to explicitly include in the instrumentation when a wake is a self wake.It's not necessarily straight forward though, this code passes through the std library
Waker
object.Additional context
No response
Would you like to work on fixing this bug?
yes