Open dastrobu opened 1 month ago
Thank you for the very detailed issue. Please don't switch to Jaeger ;).
I propose adding a LogDiscardedSpansConfig in addition to LogReceivedSpansConfig to log discarded spans that are causing issues. I can open a PR for this if you think it is a good idea.
100%. I think this is a great idea and would love a PR.
It was extremely challenging to determine why the Context was canceled. This is a weak spot in Go, in my opinion.
Yes. I have mostly recently felt this pain the frontend/read path.
Therefore, Grafana Tempo is behaving correctly, but it was extremely difficult to determine why the context was canceled.
Thank you for digging so deep. This is impressive research and a great contribution to the project.
Add a new reason label to tempo_discarded_spans_total to indicate when the client's fault causes traces to be discarded. I am unsure how to implement this, though.
I don't have a good answer for this either. Perhaps we can log the cancel cause (added in go 1.20): https://pkg.go.dev/context#WithCancelCause
I don't think much uses this yet but we could slowly add causes in our code. Also presumably go will also add causes to the framework. Perhaps this will grow in value over time?
@joe-elliott, thanks for your feedback.
100%. I think this is a great idea and would love a PR.
See #3964 – it took a few days to get it right (hopefully).
I don't have a good answer for this either. Perhaps we can log the cancel cause.
True, but for the particular case I encountered, we would need to convince the google.golang.org/grpc team to add this. Nevertheless, I'd second adding a cause to all Tempo internal context cancellations in some cases.
The grpc team is considering to add the cancel cause, see https://github.com/grpc/grpc-go/issues/7541#issuecomment-2301665760.
Found another issue, which would need to be fixed to get the cancel cause reported. batch.go would need bo the changed:
case <-ctx.Done():
- return ctx.Err()
+ return context.Cause(ctx)
}
When I find time, I might open an issue/PR to https://github.com/grafana/dskit.
@joe-elliott I created https://github.com/grafana/dskit/issues/576 and https://github.com/grafana/dskit/pull/577 in dskit. But I am not sure if there is a process to encourage someone to look at it.
Is your feature request related to a problem? Please describe.
We encountered the log message
msg="pusher failed to consume trace data" err="context canceled"
and had significant difficulty identifying the cause. The root cause was not related to the reasons documented in the troubleshooting guide.The metric
tempo_discarded_spans_total
indicated that spans were being discarded, but the logs did not clarify the reason.We ended up patching Grafana Tempo and even the Go SDK to gather more information about the discarded spans. I would like to propose improvements to observability in this area.
Here is what we did:
Additional Logging in Distributor
We added extra logging to distributor.go to identify which spans were causing issues:
I propose adding a
LogDiscardedSpansConfig
in addition toLogReceivedSpansConfig
to log discarded spans that are causing issues. I can open a PR for this if you think it is a good idea.Context Cancellation Debugging
It was extremely challenging to determine why the
Context
was canceled. This is a weak spot in Go, in my opinion. We had to patch the Go SDK to get more information about context cancellation.We added stack traces to context.go:
This led to the following logs:
As you can see, the context was closed due to an RST frame sent by the client. See
handleRSTStream
andcloseStream
.Therefore, Grafana Tempo is behaving correctly, but it was extremely difficult to determine why the context was canceled.
Confusing Metric Labels
The discarded spans are added to the
tempo_discarded_spans_total
metric with the reasonreasonInternalError
: distributor.go#L440, which was also confusing.Describe the solution you'd like
I propose two improvements:
LogDiscardedSpansConfig
to the distributor to log discarded spans causing issues.This should be easily doable.
tempo_discarded_spans_total
to indicate when the client's fault causes traces to be discarded. I am unsure how to implement this, though.Describe alternatives you've considered
Switching to Jaeger 😉.
Additional context
Grafana was deployed using the tempo-distributed Helm chart.