seanmonstar / reqwest

An easy and powerful Rust HTTP Client
https://docs.rs/reqwest
Apache License 2.0
9.93k stars 1.12k forks source link

resolve tracing span sometimes lingers even after response received #2435

Open SorenHolstHansen opened 1 month ago

SorenHolstHansen commented 1 month ago

We have noticed a weird situation in our production traces for our axum server where sometimes an endpoint, even if the endpoint makes a response very quickly, that the traces say that the endpoint took a very long time. This doesn't happen all the time, but when it happens, it is always related to reqwest calls (i.e. the reqwest crate).

A trace could look something like this (this is how it looks in our prod metrics platform in azure, you will see below that another platform shows the endpoint as shorter)

endpoint:     -------------
something:    --
reqwest_call:   -----------
stuff_after:       --
response:           --

that is, the response is done quite quickly, but because the reqwest call seems to hang, the whole endpoint shows as taking a long time. Now, the reqwest calls actually all finish very quickly. You can see that the "stuff_after" continues shortly after, and we have done other approaches to show that they actually finish fast. So something is hanging in the reqwest client that makes this show as taking a long time.

We even added a field to the root request span (using tower tracing) that shows that the endpoint is done quickly. So from the users point of view, nothing is taking a long time, but for us, we can't trust our tracing.

I made a reproducible case here: https://github.com/SorenHolstHansen/reqwest_tracing_bug

To reproduce, I started a local open telemetry platform (signoz or jaeger), started the server in the repo i linked, and then ran seq 1 5000 | xargs -P0 -I{} curl http://127.0.0.1:8008

In signoz, you should then be able to see cases like these:

Screenshot 2024-09-26 at 12 18 23

You can see in the bottom right, that the endpoint took 0 seconds, but resolve took 35 seconds.

I created a similar issue earlier in the tracing crate, which led me to believe that reqwest has long lived spans.

seanmonstar commented 1 month ago

I haven't followed the code to verify this, but a quick guess is this: when a request is started, the pool that reqwest uses will create two futures, and race them: 1) ask the pool for a newly idle connection, 2) create a new connection. If the second option is started, but then the first option wins the race (especially because resolving looks like its being slow), the second future will be spawned as a background task to finish up and insert the new connection into the pool (this reduces socket thrashing).

Perhaps that means the span around resolving outlives the original request span?

SorenHolstHansen commented 1 month ago

Yeah, the guy that answered the issue in the tracing repo thought the same thing about the span outliving the request span

SorenHolstHansen commented 1 month ago

Is it something we could convince you to take a look at?

seanmonstar commented 1 month ago

It's not something I can dig into myself, but others are welcome. The relevant span is here: https://github.com/hyperium/hyper-util/blob/fcb8565bb8090d90349b04b2caa9c8128a8ae59f/src/client/legacy/connect/dns.rs#L121

I don't know if there's a mechanism to disconnect a span from it's parent after creation, or some other solution.

mladedav commented 1 month ago

Not after creation. Parent child relationships are forever and all parents live as long as their children.

SorenHolstHansen commented 1 month ago

I thought that maybe a simple tracing subscriber filter could filter out that span? At least that was my next idea

SorenHolstHansen commented 1 month ago

Nvm, using a subscriber filter won't work since the span is created with .or_current() (see here), so can't disable it

seanmonstar commented 1 month ago

Perhaps the .or_current() could be removed, I'm not sure why it's there, I don't usually see that when declaring spans.

SorenHolstHansen commented 1 month ago

Looking at the docs for .or_current() it looks like it is the recommended way for that kind of situation, perhaps it would be better if the span had no parent, but .follows_from() its parent instead. Not an expert though, so perhaps @mladedav has some insights

seanmonstar commented 1 month ago

This seems like a possible solution: https://github.com/hyperium/hyper-util/pull/153

SorenHolstHansen commented 1 month ago

Looks good to me!

svix-jplatte commented 2 weeks ago

The new hyper-util v0.1.10 is out, but we're still seeing panics from tracing that get resolved when downgrading it to v0.1.7. Previously I was under the impression that this behavior was a combination of https://github.com/tokio-rs/tracing/issues/2870 and this bug (since the version it started happening with matches exactly), now I'm not so certain anymore.

@SorenHolstHansen Does the original bug still reproduce with hyper-util v0.1.10?

SorenHolstHansen commented 2 weeks ago

Hmmm, well the behavior is different now, but there still seems to be something off.

Now, when I run the repro case, I don't get cases where the http_request takes a short time, but the resolve takes a long time. Now, the http_request says it takes a long time as well when resolve does, i.e. see for instance this image

Screenshot 2024-11-04 at 09 47 59

I tried to then filter out the resolve by setting an env-filter of trace,hyper-util=off,hyper_util=off, but then I just got this

Screenshot 2024-11-04 at 09 46 22

i.e. I can't filter resolve away properly, which seems like a bug too, though I don't know where.

I don't know if that is related to your panics @svix-jplatte, or it's just a simple oversight from me?

seanmonstar commented 2 weeks ago

If this is still causing panics, we can yank it out. It was meant to be helpful, not crash 🙈

svix-jplatte commented 2 weeks ago

Well I'm pretty sure it's not (entirely) hyper-util's fault. It seems to only happen with some async tasks using a non-default tracing subscriber / dispatcher, so I think the tracing bug I linked is also involved.

seanmonstar commented 1 week ago

Yea, I'm certain it's a problem in tracing, hyper is doing something reasonable.

But, the benefit that change was supposed to provide is not worth causing crashes. Let's revert.