grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.
https://grafana.com/oss/tempo/
GNU Affero General Public License v3.0
4.04k stars 524 forks source link

[Problem] How can I improve tempo query performance #4239

Open g3david405 opened 4 weeks ago

g3david405 commented 4 weeks ago

I am using the latest version of tempo-distributed (v2.6.1), and my data volume is approximately 1,000 records per second, with a retention period of 21 days totaling around 900 GB. When performing TraceQL queries, I’m encountering significant performance bottlenecks, especially when querying span or resource attributes.

According to this article, https://grafana.com/docs/tempo/latest/operations/backend_search/ Here are the improvements I've implemented so far:

  1. Using the vParquet4 search engine and configuring dedicated_column for specific span or resource attributes.
  2. Enabling stream_over_http_enabled to allow Grafana to perform queries via streaming.
  3. Scaling out the querier by increasing replicas to 6.
  4. Adjusting the querier’s max_concurrent_queries and queryFrontend’s concurrent_jobs.
  5. Adding scope to attribute queries in TraceQL, for example: .http.request.method = "GET" → span.http.request.method = "GET"

However, despite these adjustments, the performance is still below acceptable levels. Are there any additional optimizations I could make?

my helm chart values is as following:

tempo:
  structuredConfig:
    stream_over_http_enabled: true

metricsGenerator:
  enabled: true
  config:
    storage:
      remote_write:
        - url: http://prometheus-server.prometheus.svc.cluster.local/api/v1/write
          send_exemplars: true

ingester:
  resources:
    limits:
      memory: 8Gi

queryFrontend:
  replicas: 2
  config:
    max_outstanding_per_tenant: 2000
    search:
      concurrent_jobs: 100
      target_bytes_per_job: 52428800

querier:
  replicas: 6
  resources:
    limits:
      memory: 10Gi
  config:
    search:
      query_timeout: 60s
    max_concurrent_queries: 30

compactor:
  replicas: 3
  config:
    compaction:
      block_retention: 504h

distributor:
  replicas: 3

traces:
  otlp:
    http:
      enabled: true
    grpc:
      enabled: true

storage:
  trace:
    block:
      version: vParquet4
      dedicated_columns:
        - name: service.name
          type: string
          scope: resource
        - name: k8s.namespace.name
          type: string
          scope: resource
        - name: url.path
          type: string
          scope: span
        - name: http.route
          type: string
          scope: span
        - name: http.target
          type: string
          scope: span
        - name: http.request.method
          type: string
          scope: span
        - name: http.response.status_code
          type: string
          scope: span
        - name: db.name
          type: string
          scope: span
        - name: db.system
          type: string
          scope: span
        - name: peer.service
          type: string
          scope: span
    backend: s3
    s3:
      access_key: 'xxx'
      secret_key: 'xxx'
      bucket: 'tempo-bucket'
      endpoint: 'minio.tenant.svc.cluster.local'
      insecure: true

global_overrides:
  defaults:
    metrics_generator:
      processors:
        - service-graphs
        - span-metrics

server:
  http_server_read_timeout: 2m
  http_server_write_timeout: 2m
  grpc_server_max_recv_msg_size: 16777216
  grpc_server_max_send_msg_size: 16777216
joe-elliott commented 3 weeks ago

1000 spans/second is quite low and I'm surprised you're seeing performance issues. Query performance can be quite difficult to debug remotely, but we will do our best.

Can you provide an example of a traceql query that is having issues and the corresponding query frontend logs? We can start there and try to figure out what is causing issues.

g3david405 commented 3 weeks ago

Hi @joe-elliott Here is my TraceQL, query backend API trace with status 5xx, limit 100 and with 12 hour duration.

{ resource.service.name =~ "$Job" && (span.url.path=~"$Url" || span.http.route=~"$Url" || span.http.target=~"$Url") && span.http.request.method=~"$Method" && span.http.response.status_code>499 && (resource.k8s.namespace.name=~"$Namespace" || "$Namespace" = ".*") }

Here is my query-frontend log

level=info ts=2024-11-02T16:05:11.870021575Z caller=search_handlers.go:186 msg="search request" tenant=single-tenant query="{ resource.service.name =~ \"{MicroService Name}\" && (span.url.path=~\"\" || span.http.route=~\"\" || span.http.target=~\"\") && span.http.request.method=~\".+\" && span.http.response.status_code>499 && (resource.k8s.namespace.name=~\".*\" || \".*\" = \".*\") }" range_seconds=43200 limit=100 spans_per_spanset=3
level=info ts=2024-11-02T16:05:45.5573293Z caller=reporter.go:257 msg="reporting cluster stats" date=2024-11-02T16:05:45.557324874Z
level=info ts=2024-11-02T16:05:45.904376455Z caller=poller.go:256 msg="successfully pulled tenant index" tenant=single-tenant createdAt=2024-11-02T16:00:37.074394709Z metas=547 compactedMetas=28
level=info ts=2024-11-02T16:05:45.904500382Z caller=poller.go:142 msg="blocklist poll complete" seconds=0.371037565
level=info ts=2024-11-02T16:06:38.069118075Z caller=search_handlers.go:167 msg="search results" tenant=single-tenant query="{ resource.service.name =~ \"{MicroService Name}\" && (span.url.path=~\"\" || span.http.route=~\"\" || span.http.target=~\"\") && span.http.request.method=~\".+\" && span.http.response.status_code>499 && (resource.k8s.namespace.name=~\".*\" || \".*\" = \".*\") }" range_seconds=43200 duration_seconds=86.199097056 request_throughput=4.140089054159755e+07 total_requests=409 total_blockBytes=29200227564 total_blocks=31 completed_requests=409 inspected_bytes=3568719382 inspected_traces=0 inspected_spans=0 status_code=-1 error=null

In this example, I only query with microservice name. If I query with API path, the query time will be longer.

And here is my grafana frontend shows streaming result Image

I can give more information if you need, thanks!!!

g3david405 commented 2 weeks ago

Any Update?

g3david405 commented 1 week ago

Still waiting for the reply, thx

joe-elliott commented 1 week ago

Howdy, apologies for letting this sit. Been overwhelmed a bit by repo activity lately.

{ resource.service.name =~ "$Job" && (span.url.path=~"$Url" || span.http.route=~"$Url" || span.http.target=~"$Url") && span.http.request.method=~"$Method" && span.http.response.status_code>499 && (resource.k8s.namespace.name=~"$Namespace" || "$Namespace" = ".*") }

This is a brutal query. The obvious issue is the number of regexes. The less obvious issue is that Tempo is really fast at evaluating a bunch of &&'ed conditions and way slower and mixed sets of conditions.

The good news:

Expect this query to improve in 2.7

g3david405 commented 5 days ago

thanks joe, I’m really looking forward to the release of version 2.7!