Open cdanis opened 2 days ago
Jaeger is supposed to de-duplicate these spans on read.
Jaeger is supposed to de-duplicate these spans on read.
The issue might be that we only enabled tags-as-fields.all=true
on both the main and archive ES configurations partway into our deployment? For at least two of these traces, they're written in archive storage "both ways":
I think it's expected that the duplicate spans will be written to storage - we don't do read before write, since it's never needed for primary storage and archive storage is the same implementation. In case of Cassandra we actually create a primary key based on span ID and content hash, so dups are handled by the storage. IIRC the ES implementation does not assign PK (which I assume would cause sharding decisions in ES and potentially another RPC from coordinator node to the required shared), so each span document gets a unique ES-internal ID and duplicates would exist if writing more than once. Once again, I think it's an acceptable behavior for archive storage because the query layer is supposed to dedupe spans with the same ID (by merging them). But it doesn't seem to be happening, which is a bug.
Sorry, I was being unclear -- I understand duplicate documents are expected, and I agree that seems like fine behavior. But I was trying to point out that enabling tags-as-fields.all=true
after having archiving traces without it enabled might be an unusual corner case for the dedup/merge logic.
I don't think it should matter. I expect deduping of spans to be done by query code, not specific storage implementation. So that flag affects the internal storage representation, but once it's read into the domain data model there shouldn't be any difference for the deduping logic.
Is the query code you're talking about the SpanIDDeduper Adjuster?
// SpanIDDeduper returns an adjuster that changes span ids for server
// spans (i.e. spans with tag: span.kind == server) if there is another
// client span that shares the same span ID. This is needed to deal with
// Zipkin-style clients that reuse the same span ID for both client and server
// side of an RPC call. Jaeger UI expects all spans to have unique IDs.
[...]
func (d *spanIDDeduper) dedupeSpanIDs() {
oldToNewSpanIDs := make(map[model.SpanID]model.SpanID)
for _, span := range d.trace.Spans {
// only replace span IDs for server-side spans that share the ID with something else
if span.IsRPCServer() && d.isSharedWithClientSpan(span.SpanID) {
We mostly don't have client spans with things as we've deployed them now ... and I don't think this would merge two server spans anyway? It would just re-ID them?
Apologies if I'm missing something, this is my second time ever looking at the codebase itself :)
Yes, this one is specific for old school Zipkin spans. I need a real deduper.
Oh jeez lol, I'm sorry.
It's only just now that I understood your very first "supposed to" as a normative statement, and not as saying that this functionality already exists but isn't working in my case.
I did a first draft: https://github.com/jaegertracing/jaeger/compare/main...cdanis:jaeger:dupe-dedupe
I'll test this out and clean it up and attempt sending a PR next week? Or if someone else can do it sooner that's fine too
What happened?
As an SRE deploying Jaeger at a medium-sized engineering organization (the Wikimedia Foundation), I want to enable trace archival support, for the usual benefits discussed in #57:
However, spans are written to our Elasticsearch(/OpenSearch) archival storage regardless of whether or not they already exist. When archived traces are read from archival storage and displayed, this leads to spans that are duplicated exponentially by however many times 'archive' was clicked.
We're running Jaeger on Kubernetes using a slight modification of the official Helm chart (if curious see chart and values)
Steps to reproduce
--es.max-span-age
younger than the trace isExpected behavior
Pressing the 'archive' button on a trace that already exists in archive storage would not mutate anything in archive storage.
Relevant log output
No response
Screenshot
When expanded, the SpanIDs for these traces show as:
If you examine the OpenSearch index directly, you find that the spans are duplicate documents with everything identical except the index-internal
_id
field:Additional context
No response
Jaeger backend version
v1.56.0
SDK
Envoy sending OpenTelemetry
Pipeline
Envoy -> otelcol -> jaeger-collector
Stogage backend
OpenSearch 2.7.0
Operating system
Linux
Deployment model
Kubernetes
Deployment configs