open-telemetry / opentelemetry-js

OpenTelemetry JavaScript Client
https://opentelemetry.io
Apache License 2.0
2.69k stars 780 forks source link

[@opentelmetry/instrumentation-fetch] Internal span setTimeout race condition prevents some metrics from being traced #4683

Open cmackenthun opened 5 months ago

cmackenthun commented 5 months ago

What happened?

Steps to Reproduce

  1. Setup application with fetch instrumentation
  2. Add a POST call (type of call shouldn't matter, but that's what ours was)
  3. Add code to navigate to a new site outside of your application after that call is successful
  4. Run code in a variety of browsers and internet speeds

Expected Result

All network calls for that endpoint are traced

Actual Result

A percentage of these network calls are not traced

Additional Details

We have seen that one of our metrics is not being reported 100% of the time and have narrowed it down to the usage of setTimeout in opentelemetry-instrumentation-fetch as the reason why. Sometimes the routing happens quick enough that the timeout function hasn't run and the span is never ended and therefore can never be flushed or collected on our side.

There is a comment in the fetch instrumentation code about the reasoning for the timeout being waiting for "observer to collect information about resources", but I think overall we would prefer to get metrics that are lacking these resources than none at all. If the setTimeout has to stick around, it would be nice if the fetch instrumentation could know to end these spans when the page is being navigated away from so we can get a higher count of spans

Some docs about setTimeout not be an overly consistent method to use for exact timing: https://developer.mozilla.org/en-US/docs/Web/API/setTimeout#reasons_for_delays_longer_than_specified

dyladan commented 4 months ago

@cmackenthun can you please provide a minimal reproduction so we can work on this?

cmackenthun commented 4 months ago

Sure thing @dyladan

Example repo: https://github.com/cmackenthun/otel--fetch-instrumentation-example

Basically I just copied the example directory from the repo and made some changes to show my usecase. As you can hopefully see from testing (or my gifs below). If I wait for the fetch call to be done and then flush and navigate away, I don't get my fetch span. But if I add an arbitrary wait between the fetch and flush, then I do get my fetch span exported. To me this is because of the setTimeout that is used to end the span inside the fetch instrumentation flow.

fetch() ➡️ flush() ➡️ redirect()

2024-05-17 11 25 28

fetch() ➡️ wait 1s ➡️ flush() ➡️ redirect()

2024-05-17 11 26 04