grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.
https://grafana.com/oss/tempo/
GNU Affero General Public License v3.0
3.91k stars 510 forks source link

Allow a way to query for large traces that exceeded the ingester limit #3862

Open atavakoliyext opened 2 months ago

atavakoliyext commented 2 months ago

Is your feature request related to a problem? Please describe.

We currently have a deployment with global.max_bytes_per_trace configured. Occasionally, some traces exceed this limit, which we see in the distributor/ingester logs as level=warn ... desc = TRACE_TOO_LARGE: max size of trace (...) exceeded while adding 55431 bytes to trace <trace-id> for tenant ...".

However, when we want to troubleshoot what was in those traces leading up to the exceedance, the query fails in Grafana with the following error:

failed to get trace with id: ... Status: 500 Internal Server Error Body: trace exceeds max size (max bytes: <global.max_bytes_per_trace>)

In Grafana, the TraceQL expression used is the exact trace ID reported in the distributor/ingester warning logs.

In a large distributed system, not being able to see the content of these traces makes it difficult to troubleshoot what led to its large size & repair it.

It is also unexpected that a limit is enforced at the query layer at all, because #1225, which proposes query-layer enforcement, is still open.

Describe the solution you'd like

If a trace size limit is already configured & being enforced at the ingest layer, I would like the option of disabling trace size limit enforcement at the query layer, or have a separate configurable limit for queries.

Describe alternatives you've considered

We've adjusted the server.grpc_server_max_(send|recv)_msg_size limits to be higher than the trace size limits, though this had no effect (and the error messages didn't indicate a gRPC limit being hit first).

We have also increased the trace size limit in order to attempt to find large traces using metrics-generator metrics, before those traces hit the previous limit; but, the process of going from metric to a specific large trace is toilsome, and fails to catch cases where a bug in code quickly creates large counts/sizes of spans in a short period of time. It also required us to set the limit to larger than we'd like as a steady state, partially defeating the purpose of a limit.

Additional context

Our tempo deployment is configured via Helm chart, so any changes in configuration would need to also be configurable through helm.

joe-elliott commented 2 months ago

So there are questions you can ask Tempo using TraceQL to get an understanding of what is creating your enormous trace. These features are in main now.

{ trace:id = "foo" } | by(resource.service.name) | select(trace:rootService, trace:rootName)
{ trace:id = "foo" } | rate() by(resource.service.name, name)

The above can help, but what I'd really like to do is create a /api/v2/traces/<id> endpoint that returns the trace nested one layer deep. This would allow us to return something like:

{
  trace: {...},
  warning: "trace exceeds max size. partial trace returned."
}

There is currently no good way to tell a client that we are returning only part of the trace. HTTP status code 206 is kind of similar and may also be an option, but i'd prefer revving the endpoint and having a way to pass detailed messages back to the client.

atavakoliyext commented 2 months ago

Thanks for your response @joe-elliott. Having the endpoint you described sounds like a good long-term direction; but I was still surprised that there's any query-level enforcement of trace size limits, ~given that #1225 is still open~ (I see it was just closed). ~Was this delivered through a different issue, or was it always the case that trace size limits were enforced in queries?~ If this is expected behavior, does my request to have the configuration for ingest-level vs query-level trace limit enforcement sound reasonable, even with the endpoint you described?

{ trace:id = "foo" } | by(resource.service.name) | select(trace:rootService, trace:rootName)
{ trace:id = "foo" } | rate() by(resource.service.name, name)

These features are in main now.

Thanks for sharing these! I'm excited to try them out. Are these currently scoped to (or exist in some form in) a release?

joe-elliott commented 2 months ago

The primary reason to enforce this on the query path is to prevent instability from parsing and combining 100s of MBs of trace data. Thank you for pointing out that issue! As you noticed I closed it :).

If this is expected behavior, does my request to have the configuration for ingest-level vs query-level trace limit enforcement sound reasonable, even with the endpoint you described?

I think splitting this limit into two is a fine idea. We would definitely accept that PR.

Are these currently scoped to (or exist in some form in) a release?

I think we are in luck. They should be doable in 2.5. I thought perhaps trace:id did not make the cutoff but it did.

github-actions[bot] commented 6 days ago

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.