grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.
https://grafana.com/oss/tempo/
GNU Affero General Public License v3.0
3.93k stars 509 forks source link

TraceQL query '{}' does not return all spans #3791

Closed nerdvegas closed 3 weeks ago

nerdvegas commented 3 months ago

Describe the bug Using Tempo v2.4.2 via otel-lgtm container. Running query '{}' returns significantly less spans than '{name=~".+"}'. Docs don't mention this - am completely mystified. First noticed when the set of span names logged by otel collector when adding a 'debug' exporter, didn't match results we were getting from '{}' query.

To Reproduce Steps to reproduce the behavior:

Expected behavior Results should be the same.

Environment: Using Tempo v2.4.2 via otel-lgtm container.

nerdvegas commented 3 months ago

Further info.

The {} query appears to be picking up cases where there are unnamed spans, eg:

    {
      "traceID": "5c44da17ce4a17ea7ae25735637d49ad",
      "rootServiceName": REDACTED",
      "rootTraceName": "REDACTED"",
      "startTimeUnixNano": "1718761370617897594",
      "spanSet": {
        "spans": [
          {
            "spanID": "d11758357946fe59",
            "startTimeUnixNano": "1718761370617911249",
            "durationNanos": "29105"
          },
          {
            "spanID": "f5e55735c11b75cd",
            "startTimeUnixNano": "1718761370617898485",
            "durationNanos": "42229"
          },
          {
            "spanID": "7ae25735637d49ad",
            "startTimeUnixNano": "1718761370617897594",
            "durationNanos": "45445"
          }
        ],
        "matched": 3
      },

However, these extra unnamed spans are nowhere to be found in the otelcol debug exporter output.

{} should be a superset of {name~".+"} right? I have limit set to max (100,000).

I also can't figure out why spans are sometimes unnamed, or what the difference is between spanSet and spanSets returned in the response. https://grafana.com/docs/tempo/latest/api_docs/#search has no mention of either of these.

nerdvegas commented 3 months ago

More:

After reading the following, I added start/end to make sure traceQL is pulling from the backend in all cases:

end = (unix epoch seconds) Optional. Along with start, define a time range from which traces should be returned. Providing both start and end changes the way that Tempo searches. If the parameters aren’t provided, then Tempo searches the recent trace data stored in the ingesters. If the parameters are provided, it searches the backend as well.

However the results are the same - {} returns a large number of unnamed spans (perhaps not unexpected), but {name~".+"} is still not a subset (it contains spans not returned by {}) - despite limit being set in both cases, and the total number of traces and spans being well under 100,000.

joe-elliott commented 3 months ago

A quick attempt internally is not reproducing this issue. Over a 5 minute period on a low volume test tenant these two queries return the exact same spans.

Can you share the spans that are returned by {name~".+"} that are not returned by {}? I assume we are querying the exact same historical time range every time? Are the results consistent?

I also can't figure out why spans are sometimes unnamed

Tempo will not return the name unless you request it. {} | select(name) should return all spans with their names.

spanSet and spanSets

Originally we only had spanSet but we made an API change and currently populate both spanSet and spanSets b/c older version of Grafana still use spanSet. Ignore spanSet and only parse spanSets.

nerdvegas commented 3 months ago

Ignore spanSet and only parse spanSets.

good info thanks

Are the results consistent?

Yes

Can you share the spans that are returned by {name~".+"} that are not returned by {}?

There are a few 100 span names, but I've been looking at one specifically because it shows a large discrepancy. In the {} query I get 102 traces containing that span; in the {name~".+"} query I get just over 500. In the latter query, rootTraceName is set to the span in question.

Would it help if I can give you a dump of the data? I have /data and /tmp/tempo bind mounted when I launch the grafana/otel-lgtm container, so I can resurrect the same grafana session later. Size is approx 200Kb as a tar.gz. I would have to check with my employer first though.

joe-elliott commented 3 months ago

An info dump would be helpful. If you are recreating this with a load generator and a simple set of steps that would work too.

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.