grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.
https://grafana.com/oss/tempo/
GNU Affero General Public License v3.0
3.93k stars 509 forks source link

tempo-search: <root span not yet received> displayed in search results even though root span exists #1476

Open annanay25 opened 2 years ago

annanay25 commented 2 years ago

Describe the bug

It has been reported that in certain cases, the tempo search results will display "root span not yet received" for a trace even if the underlying data indicates the existence of a span with an empty parentSpanID.

image

In this case it was noticed that there were two batches in the trace json, one of which had a span with an empty parentSpanID.

To Reproduce Steps to reproduce the behavior:

  1. Start Tempo (SHA or version)
  2. Perform Operations (Read/Write/Others)

Expected behavior

The "root span not yet received" should have been replaced by the service name of the root span.

Environment:

Additional Context

Since just one batch in the trace json had a root span, it might be an issue with WAL dedupe?

joe-elliott commented 2 years ago

By default we are using RF=3 and the overall write succeeds as long as 2 of the 3 ingester writes succeed. What this means is it's possible that the trace can be found in 3 places (one of which doesn't include the root span). If the search criteria happens to match the copy of the trace that does not include the root span search will always return "root span not found".

Then, when the user does a trace by id search they will receive the entire trace b/c we find and recombine all parts of the trace.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.

logamanig commented 1 year ago

It's not only showing root span not yet received, it makes the search inaccurate as the traces is not even appearing. I need to search multiple times to get the result. this is affected mostly on the traces with higher number of spans. I have some traces that exceeds 250 spans, and no result or root span not yet received are keep coming. it's difficult to make the whole team to understand the tempo design.

And traces with more spans is not appearing in the normal search at all. Just selecting the service is not enough to list down traces with high no. of spans. need to add additional filters like SpanName or Min. Duration, still need to search multiple times and sometimes multiple search also didn't help

zohnannor commented 9 months ago

Is there any way to at least work around this issue? Or are there plans to fix it in the future?

joe-elliott commented 9 months ago

We have fixed at least one bug that caused this to occur more often than it should have, but fundamentally it could always happen. If the root span is received between the search and the trace by id lookup then this will occur.

One option is to use traceql to exclude traces for which a root span has not been received like this:

{ <conditions i care about> && rootServiceName != "" }
fede843 commented 7 months ago

I am seeing a lot of them as well. It complicates the filtering, as I do filter by service name, and it does not work for these. In my case I have one instance of each component in a docker monolithic, so there is no duplication or majority rule. Query inspector is showing the first span with the parent 0000000000000000, but yet, it comes with the <root span not yet received>

joe-elliott commented 7 months ago

tempo does have metrics to help with this now. this query:

1 - sum(rate(tempo_warnings_total{reason="disconnected_trace_flushed_to_wal"}[5m])) / sum(rate(tempo_ingester_traces_created_total{}[5m]))

will give you a value between 0 and 1 that indicates what percentage of your traces are complete when they are flushed to the wal. flushing a trace to a wal makes it discoverable in search. if this value is close to 1 you should see very few <root span not yet received> messages.

fede843 commented 7 months ago

got values between 0.02 and 0.07.

if this value is close to 1 you should see very few <root span not yet received> messages.

You meant the opposite?

joe-elliott commented 7 months ago

no. 1 means 100% of your traces are fully connected when flushed to the wal. this is almost impossible to maintain.

.02 -> .07 means 2-7% of your traces are fully connected.

fede843 commented 7 months ago

could you define what fully connected is in this context? maybe an example?

joe-elliott commented 7 months ago

fully connected means that all parent span ids are found in the trace. so tempo will check every single span in the trace and confirm that all parent ids for all spans are also present. if a single parent is missing it considers it a disconnected trace

fede843 commented 7 months ago

OK, I guessed that was the meaning, but there is no way only 2-7% are the only connected traces. It does not make sense in our case at least. I thought that that % was the unconnected part. Not sure then what is going on there.

gao-artur commented 7 months ago

@fede843 are you sure you didn't miss the 1 - before sum?

fede843 commented 7 months ago

aggghh silly me. That was it. Sorry. Now it makes sense. Thanks for the patience.

Coming back to the former issue, what is the status? Is there anything I can share to spot the root cause?