Open pbrisbin opened 3 months ago
The ID generator mechanism is intended to be pluggable, so my initial instinct would be to try replacing the default one to see if it helps: https://hackage.haskell.org/package/hs-opentelemetry-api-0.0.3.3/docs/OpenTelemetry-Trace-Id-Generator.html
Possibly the PRNG is starting with the same seed or I made a mistake in the implementation otherwise.
Also, we had some weirdness where a third party was sending their propagated context to our backend. Probably not related to this, but probably worth ensuring that you strip those headers at the boundaries of your system.
my initial instinct would be to try replacing the default one to see if it helps
Hmm yeah, I've been looking at its code but it seems pretty hard to get wrong. I worry about replacing your implementation with mine and somehow assuming I'll do better. I'd really like to prove this is happening and why before going that route.
Thanks for the tips.
This looks eerily similar to the problem we had, which appears to be fixed by https://github.com/iand675/hs-opentelemetry/pull/116
Wow, OK, there's a way that could explain this!
fancy-api
gets a request for /activity-feed
, it sets a fresh Trace Id abc
/something-else
, it incorrectly gets Trace Id abc
/something-else
calls curricula-api
and propagates abc
to it via request headerscurricula-api
records its trace with abc
At first, I didn't think this could be it, because there are no spans in the Trace from /something-else
. But we use tail-sampling in fancy-api
and only export traces that are errors or > 100ms, so it's entirely possible we'd get no report of (2) or (3) at all. This would explain why (4) appears as part of the Trace with (1) (which we got because it was an error), but without any other parents (since those were never exported at all).
:exploding_head:
Thanks @velveteer !
Hi there-
We've gone all in on OTel and finally implemented in across a handful of services using this SDK. We're finding some cases where seemingly-unrelated Spans appear in the same Trace. This smells like a collision in the Trace Id generator.
Example:
Here, our
fancy-api
service handles a request, times out after 10s, but 15s later we see spans under the same Trace Id from another service (curricula-api
).We do propagate trace context via request headers when
fancy-api
callscurricula-api
, but we don't expect this particular request to be doing that, and certainly not ~15s after it had already ended.I realize this project is looking for maintainers (:wave:), but I'm wondering if anyone has any suggestions for proving it's a collision, and/or leads to look into addressing it.