iand675 / hs-opentelemetry

OpenTelemetry support for the Haskell programming language
https://iankduncan.com/projects/opentelemetry
BSD 3-Clause "New" or "Revised" License
66 stars 33 forks source link

Trace Id collisions #130

Open pbrisbin opened 3 months ago

pbrisbin commented 3 months ago

Hi there-

We've gone all in on OTel and finally implemented in across a handful of services using this SDK. We're finding some cases where seemingly-unrelated Spans appear in the same Trace. This smells like a collision in the Trace Id generator.

Example:

Example trace with unrelated spans

Here, our fancy-api service handles a request, times out after 10s, but 15s later we see spans under the same Trace Id from another service (curricula-api).

We do propagate trace context via request headers when fancy-api calls curricula-api, but we don't expect this particular request to be doing that, and certainly not ~15s after it had already ended.

I realize this project is looking for maintainers (:wave:), but I'm wondering if anyone has any suggestions for proving it's a collision, and/or leads to look into addressing it.

iand675 commented 3 months ago

The ID generator mechanism is intended to be pluggable, so my initial instinct would be to try replacing the default one to see if it helps: https://hackage.haskell.org/package/hs-opentelemetry-api-0.0.3.3/docs/OpenTelemetry-Trace-Id-Generator.html

Possibly the PRNG is starting with the same seed or I made a mistake in the implementation otherwise.

iand675 commented 3 months ago

Also, we had some weirdness where a third party was sending their propagated context to our backend. Probably not related to this, but probably worth ensuring that you strip those headers at the boundaries of your system.

pbrisbin commented 3 months ago

my initial instinct would be to try replacing the default one to see if it helps

Hmm yeah, I've been looking at its code but it seems pretty hard to get wrong. I worry about replacing your implementation with mine and somehow assuming I'll do better. I'd really like to prove this is happening and why before going that route.

Thanks for the tips.

velveteer commented 3 months ago

This looks eerily similar to the problem we had, which appears to be fixed by https://github.com/iand675/hs-opentelemetry/pull/116

pbrisbin commented 3 months ago

Wow, OK, there's a way that could explain this!

  1. fancy-api gets a request for /activity-feed, it sets a fresh Trace Id abc
  2. Wai/warp re-use the thread-context for another request to /something-else, it incorrectly gets Trace Id abc
  3. /something-else calls curricula-api and propagates abc to it via request headers
  4. curricula-api records its trace with abc

At first, I didn't think this could be it, because there are no spans in the Trace from /something-else. But we use tail-sampling in fancy-api and only export traces that are errors or > 100ms, so it's entirely possible we'd get no report of (2) or (3) at all. This would explain why (4) appears as part of the Trace with (1) (which we got because it was an error), but without any other parents (since those were never exported at all).

:exploding_head:

Thanks @velveteer !