jaegertracing / jaeger

CNCF Jaeger, a Distributed Tracing Platform
https://www.jaegertracing.io/
Apache License 2.0
20.6k stars 2.45k forks source link

jaeger-v2 e22 tests are unstable #5418

Closed yurishkuro closed 5 months ago

yurishkuro commented 6 months ago

@akagami-harsh since merging #5398 I noticed this test failing sporadically. Here's one failed run: https://github.com/jaegertracing/jaeger/actions/runs/8955823698/job/24596965281?pr=5416

=== RUN   TestCassandraStorage/GetLargeSpans
    integration.go:205: Testing Large Trace over 10K ...
    integration.go:375: Dropped binary type attributes from trace ID: 0000000000000011
2024/05/05 05:05:57 traces export: context deadline exceeded: rpc error: code = DeadlineExceeded desc = context deadline exceeded
    integration.go:145: Waiting for storage backend to update documents, iteration 1 out of 100
    integration.go:145: Waiting for storage backend to update documents, iteration 2 out of 100
...
    integration.go:145: Waiting for storage backend to update documents, iteration 99 out of 100
    integration.go:145: Waiting for storage backend to update documents, iteration 100 out of 100
    integration.go:215: 
            Error Trace:    /home/runner/work/jaeger/jaeger/plugin/storage/integration/integration.go:215
            Error:          Should be true
            Test:           TestCassandraStorage/GetLargeSpans
    trace_compare.go:74: 
            Error Trace:    /home/runner/work/jaeger/jaeger/plugin/storage/integration/trace_compare.go:74
                                        /home/runner/work/jaeger/jaeger/plugin/storage/integration/trace_compare.go:60
                                        /home/runner/work/jaeger/jaeger/plugin/storage/integration/integration.go:216
            Error:          Not equal: 
                            expected: 10008
                            actual  : 10012

There are a couple notable things here:

  1. there was a DeadlineExceeded gRPC error earlier, probably during writing of the spans
  2. It's odd that there are more spans returned than written

I don't know if we have any retries in the writing pipeline, but even if we did the writes are supposed to be idempotent since we derive a primary key from trace/span ID and the content hash. It may be useful to debug this further and perhaps log the difference between written and loaded trace/span IDs (trace ID is supposed to be the same), and perhaps this test needs to be more resilient to duplicates.

yurishkuro commented 6 months ago

Another example, this time from ES:

=== RUN   TestESStorage/GetLargeSpans
    integration.go:205: Testing Large Trace over 10K ...
    integration.go:375: Dropped binary type attributes from trace ID: 0000000000000011
2024/05/05 16:47:30 traces export: context deadline exceeded: rpc error: code = DeadlineExceeded desc = context deadline exceeded
    integration.go:145: Waiting for storage backend to update documents, iteration 1 out of 100

I wonder if this is due to the fact that when a large batch is sent to the exporter, since our v1 storage implementations only save one span at a time we may have the situation that storage loops through all the 10k spans but the exporter meanwhile times out on the context deadline. It's not completely clear why 'rpc' would be mentioned in this case, however.

yurishkuro commented 5 months ago

Tests have been good recently, closing.