Open SorenHolstHansen opened 1 month ago
I haven't followed the code to verify this, but a quick guess is this: when a request is started, the pool that reqwest uses will create two futures, and race them: 1) ask the pool for a newly idle connection, 2) create a new connection. If the second option is started, but then the first option wins the race (especially because resolving looks like its being slow), the second future will be spawned as a background task to finish up and insert the new connection into the pool (this reduces socket thrashing).
Perhaps that means the span around resolving outlives the original request span?
Yeah, the guy that answered the issue in the tracing repo thought the same thing about the span outliving the request span
Is it something we could convince you to take a look at?
It's not something I can dig into myself, but others are welcome. The relevant span is here: https://github.com/hyperium/hyper-util/blob/fcb8565bb8090d90349b04b2caa9c8128a8ae59f/src/client/legacy/connect/dns.rs#L121
I don't know if there's a mechanism to disconnect a span from it's parent after creation, or some other solution.
Not after creation. Parent child relationships are forever and all parents live as long as their children.
I thought that maybe a simple tracing subscriber filter could filter out that span? At least that was my next idea
Nvm, using a subscriber filter won't work since the span is created with .or_current()
(see here), so can't disable it
Perhaps the .or_current()
could be removed, I'm not sure why it's there, I don't usually see that when declaring spans.
Looking at the docs for .or_current()
it looks like it is the recommended way for that kind of situation, perhaps it would be better if the span had no parent, but .follows_from()
its parent instead. Not an expert though, so perhaps @mladedav has some insights
This seems like a possible solution: https://github.com/hyperium/hyper-util/pull/153
Looks good to me!
The new hyper-util v0.1.10 is out, but we're still seeing panics from tracing that get resolved when downgrading it to v0.1.7. Previously I was under the impression that this behavior was a combination of https://github.com/tokio-rs/tracing/issues/2870 and this bug (since the version it started happening with matches exactly), now I'm not so certain anymore.
@SorenHolstHansen Does the original bug still reproduce with hyper-util v0.1.10?
Hmmm, well the behavior is different now, but there still seems to be something off.
Now, when I run the repro case, I don't get cases where the http_request
takes a short time, but the resolve
takes a long time. Now, the http_request
says it takes a long time as well when resolve
does, i.e. see for instance this image
I tried to then filter out the resolve by setting an env-filter of trace,hyper-util=off,hyper_util=off
, but then I just got this
i.e. I can't filter resolve
away properly, which seems like a bug too, though I don't know where.
I don't know if that is related to your panics @svix-jplatte, or it's just a simple oversight from me?
If this is still causing panics, we can yank it out. It was meant to be helpful, not crash 🙈
Well I'm pretty sure it's not (entirely) hyper-util
's fault. It seems to only happen with some async tasks using a non-default tracing subscriber / dispatcher, so I think the tracing bug I linked is also involved.
Yea, I'm certain it's a problem in tracing, hyper is doing something reasonable.
But, the benefit that change was supposed to provide is not worth causing crashes. Let's revert.
We have noticed a weird situation in our production traces for our axum server where sometimes an endpoint, even if the endpoint makes a response very quickly, that the traces say that the endpoint took a very long time. This doesn't happen all the time, but when it happens, it is always related to reqwest calls (i.e. the reqwest crate).
A trace could look something like this (this is how it looks in our prod metrics platform in azure, you will see below that another platform shows the endpoint as shorter)
that is, the response is done quite quickly, but because the reqwest call seems to hang, the whole endpoint shows as taking a long time. Now, the reqwest calls actually all finish very quickly. You can see that the "stuff_after" continues shortly after, and we have done other approaches to show that they actually finish fast. So something is hanging in the reqwest client that makes this show as taking a long time.
We even added a field to the root request span (using tower tracing) that shows that the endpoint is done quickly. So from the users point of view, nothing is taking a long time, but for us, we can't trust our tracing.
I made a reproducible case here: https://github.com/SorenHolstHansen/reqwest_tracing_bug
To reproduce, I started a local open telemetry platform (signoz or jaeger), started the server in the repo i linked, and then ran
seq 1 5000 | xargs -P0 -I{} curl http://127.0.0.1:8008
In signoz, you should then be able to see cases like these:
You can see in the bottom right, that the endpoint took 0 seconds, but resolve took 35 seconds.
I created a similar issue earlier in the tracing crate, which led me to believe that reqwest has long lived spans.