jaegertracing / jaeger

CNCF Jaeger, a Distributed Tracing Platform
https://www.jaegertracing.io/
Apache License 2.0
20.15k stars 2.4k forks source link

[tests]: Duplicate record in loadParseAndWriteLargeTrace #4466

Open sklarsa opened 1 year ago

sklarsa commented 1 year ago

What happened?

I'm working on a new storage backend, and noticed this peculiarity in the testGetLargeSpan's loadParseAndWriteLargeTrace method.

Is there a reason why the first fixture's span is added to the trace on line on line 293? https://github.com/jaegertracing/jaeger/blob/3021d314f71b9ce0707e8f5a443dd49b6a73d7da/plugin/storage/integration/integration.go#L293 Since the spanID for that trace is 3, after the loop is finished, there are two spans with identical trace & span ID's, but different timestamps. In practice, would this ever happen? If not, can we remove this

Steps to reproduce

Run the test, break at the end of the loop, and query your datastore:

select trace_id_high, trace_id_low, span_id, timestamp from spans
where span_id=3

Screen Shot 2023-05-23 at 11 02 19 AM

Expected behavior

There should only be 1 returned span... I think?

Relevant log output

No response

Screenshot

No response

Additional context

No response

Jaeger backend version

main branch

SDK

No response

Pipeline

No response

Stogage backend

QuestDB (under development)

Operating system

Linux

Deployment model

No response

Deployment configs

No response

Monnoroch commented 1 year ago

Same thing has happened to me today. I am testing my new backend and this test is failing.

I think, the bug here is that "example_trace".Spans[0].SpanID is in [1, 10008), meaning two spans with the same ID are written.

Changing the code to:

trace := s.getTraceFixture(t, "example_trace")
span := trace.Spans[0]
span.SpanID = model.SpanID(20008) // ADDED

fixes the issue.

Upd 1: seems to be a duplicate of https://github.com/jaegertracing/jaeger/issues/3912? However, https://github.com/jaegertracing/jaeger/pull/3913 didn't fix the problem fully.

Upd 2: please note that the issue is NOT because the storage backend can't handle duplicate span ids. The issue is because when the duplicate span is overwritten, the expected test outcome is NOT equal to the actual outcome, which has 1 fewer spans than expected.

Upd 3: multiple{1,2,3}_trace.json also contain identical trace ids, making the FindTraces test fail.

Monnoroch commented 1 year ago

Another issue in integration tests is that GetDependencies is disabled for gRPC servers. That is quite unfortunate, since my storage server implements the DependenciesReaderPlugin.GetDependencies gRPC method, computing those dependencies from ingested spans.

yurishkuro commented 1 year ago

IMO the storage backends must be able to support multiple spans with the same ID (the official Jaeger backends do). We can have a debate where it's a YAGNI kind of feature, but there are certainly valid use cases. One obvious example is Zipkin's shared client/server span model.

Monnoroch commented 1 year ago

Can you please advise on how exactly I should support it? In your comment you suggest overwriting data when the same span id is ingested the second time. This is exactly what I am doing already. I explicitly mentioned in my comment above that this is NOT the issue (see "Upd 2").

Monnoroch commented 1 year ago

You can take a look at the change that fixes all issues with remote gRPC server tests: https://github.com/jaegertracing/jaeger/pull/4468. The PR has some extra stuff for my local testing, but it's eary enough to understand.

yurishkuro commented 1 year ago

Can you please advise on how exactly I should support it?

For instance, Cassandra storage implementation computes a hash of all span data and uses <traceID, spanID, hash> as the primary key, thereby allowing multiple spans with the same ID.

Monnoroch commented 1 year ago

Just to be sure we're on the same page, so what you mean is that the comment you made in the other bug about overwriting span data is not the correct way of handling duplicate span ids?

yurishkuro commented 1 year ago

Yeah, it's not completely correct - it's ok to overwrite the data if the data is the same, which is what the additional hash value is achieving. E.g. if the same span is ingested several times into Cassandra, it would still have only a single record because the hash will be the same.

Monnoroch commented 1 year ago

Gotcha, thank you for clarifying!

sklarsa commented 1 year ago

So going back to that test, is it correct as-is? Or do we need to reduce the expected number of traces by 1 since we're sending a duplicate?

yurishkuro commented 1 year ago

I didn't have time to look into details, but my thinking is:

Monnoroch commented 1 year ago

if some tests occasionally have repeated span IDs, it's a good thing since they test this specific Zipkin use case

Perhaps it would be best to have an explicit test, testing support for duplicate ids? That way it could also serve as documentation. It took me a few hours of reading code and bugs to understand how it is meant to work and I still got it wrong.

yurishkuro commented 1 year ago

agreed

yurishkuro commented 1 year ago

we could also document this expectation in the Protobuf file for the storage API

jkowall commented 3 months ago

This doesn't seem like a bug, maybe a missing test....

yurishkuro commented 3 months ago

@jkowall let me reopen this one, it seems important to fix. It's not just a test, it's the contract with the storage (especially 3rd party storage implementations).