grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.
https://grafana.com/oss/tempo/
GNU Affero General Public License v3.0
3.76k stars 488 forks source link

Multiple root spans and orphaned spans being combined into a single trace #3776

Open codedinsugar opened 2 weeks ago

codedinsugar commented 2 weeks ago

Describe the bug When viewing traces in Grafana Tempo v2.3.1 microservices deployment, a small percentage of our traces are exhibiting the strange behavior of:

For traces showing two requests, the trace view shows a gap between the markers under the Service & Operation table. Meaning the trace has a duration of 7 seconds but the duration of each request is under 5ms. There is a 7 second gap between the first request and the second request. See white line in screenshot.

grafana-tempo-weird-traces-problem

To Reproduce Steps to reproduce the behavior:

  1. Manually instrument a Java application with OTel framework 1.37.0
  2. Send 50k-100k requests to the microservice
  3. Observe traces in Grafana Tempo

Expected behavior Only a single request should be shown in each trace

Environment:

Additional Context

mapno commented 2 weeks ago

Hi @codedinsugar. At first glance, this looks like an issue with the instrumentation—ie. the same "context" is being used for both calls. Can you verify that you're independently creating new traces for each call without reusing something that might carry the context of previous traces? Or, would it be possible to have a reproducible setup that generates those traces? Manually instrument a Java application with OTel framework 1.37.0 is too generic.

joe-elliott commented 2 weeks ago

Other things to consider:

codedinsugar commented 2 weeks ago

@mapno that is something that we're considering. We've recently refactored our app along these lines but I think the tracer instantiation might still be an issue for us. One thing we're trying is to add traceId as a span attribute with Span.current().getSpanContext().getTraceId() but the value is always all 0's and we're not sure why. Any thoughts here?

We'd like to provide a reproducible setup but this is a proprietary monolith and cannot be shared. We might be able to build a smaller sanitized version but that'll take time.

@joe-elliott thanks for the suggestion and we'll consider it, our only concern is the "This is not recommended for production environments" statement.

joe-elliott commented 2 weeks ago

This is not recommended for production environments

I would definitely not leave it on permanently, but for a short time period it maybe helpful.