grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.
https://grafana.com/oss/tempo/
GNU Affero General Public License v3.0
3.91k stars 510 forks source link

[TraceQL metric] missing values when querying the backend when updating to v2.6.0 #4064

Open ValentinLvr opened 1 week ago

ValentinLvr commented 1 week ago

Describe the bug

Context:

When using the traceQL metric feature in grafana, only the datapoints from metrics-generator are rendered. For example when querying something like {resource.service.name="foo"} | rate(), I'm only seeing the last 30mn. I tried different query_backend_after options and, as expected, it seems the frontend doesn't retrieve the backend values.

Didn't see anything in logs & traces. Also, I can see the traces in the S3 bucket so it seems the ingestion is working well. I tried using the metric query range api directly and I still miss all the backend values

Image

To Reproduce Steps to reproduce the behaviour:

  1. Install Tempo-distributed chart v1.1.8.0
  2. Use S3 as a backend
  3. Use vParquet3 or vParquet4 (I reproduced with these 2 values)
  4. make a simple TraceQL metric query in grafana {resource.service.name="foo"} | rate()
  5. Observe the missing values in the histogram

Expected behaviour Histogram rendered by traceQL metric should be complete with values retrieved from the backend & metric generator

Image

Environment:

Additional Context

# values.yaml
...
tempo-distributed:
  tempo:
    structuredConfig:
      use_otel_tracer: true
      query_frontend:
        metrics:
          concurrent_jobs: 500
          target_bytes_per_job: 2.097e+8
          query_backend_after: 30m
      metrics_generator:
        processor:
          local_blocks:
            filter_server_spans: false # we want all kind of spans when using traceQL metrics
        traces_storage:
          path: /var/tempo/generator/traces
  serviceAccount:
    name: tempo
  rbac:
    create: false
  ingester:
    replicas: 5
    autoscaling:
      enabled: false
    podLabels:
      forward-logs: "false"
    resources:
      requests:
        memory: "4500Mi"
        cpu: 1
      limits:
        memory: "4500Mi"
    persistence:
      enabled: false
  # Configuration for the metrics-generator
  metricsGenerator:
    enabled: true
    replicas: 1
    podLabels:
      forward-logs: "false"
    resources:
      requests:
        memory: "6550Mi"
        cpu: "1500m"
      limits:
        memory: "6550Mi"
    processor:
      span_metrics:
        intrinsic_dimensions:
          service: true
          span_name: false
          span_kind: true
          status_code: true
    config:
      storage:
        path: /var/tempo/wal
        remote_write_flush_deadline: 3m
        remote_write:
          - url: foo
            name: prometheus-0
          - url: bar
            name: prometheus-1
  global_overrides:
    defaults:
      ingestion:
        max_traces_per_user: 50000
      metrics_generator:
        processors:
          - service-graphs
          - local-blocks
  distributor:
    replicas: 3
    autoscaling:
      enabled: false
    podLabels:
      forward-logs: "false"
    resources:
      requests:
        memory: "3500Mi"
        cpu: "800m"
      limits:
        memory: "3500Mi"
  compactor:
    replicas: 1
    resources:
      requests:
        memory: "6000Mi"
        cpu: "700m"
      limits:
        memory: "6000Mi"
    config:
      compaction:
        block_retention: 336h #14 days
  querier:
    replicas: 5
    autoscaling:
      enabled: false
    resources:
      requests:
        memory: "5500Mi"
        cpu: 2
      limits:
        memory: "5500Mi"
    config:
      max_concurrent_queries: 20
      search:
        query_timeout: 2m
  queryFrontend:
    replicas: 2
    config:
      search:
        concurrent_jobs: 1000
        target_bytes_per_job: 104857600
    autoscaling:
      enabled: false
    ingress:
      enabled: false
    resources:
      requests:
        memory: "800Mi"
        cpu: "400m"
      limits:
        memory: "800Mi"
  traces:
    otlp:
      http:
        enabled: true
      grpc:
        enabled: true

  server:
    grpc_server_max_recv_msg_size: 4194304
    grpc_server_max_send_msg_size: 4194304
    http_server_read_timeout: 2m # -- Read timeout for HTTP server

  storage:
    trace:
      block:
        version: vParquet3
        dedicated_columns:
          - scope: span
            name: store.id
            type: string
          - scope: span
            name: query.request
            type: string
          - scope: span
            name: environment
            type: string
          - scope: span
            name: query
            type: string
          - scope: span
            name: matchers
            type: string
          - scope: span
            name: grpc.request.request-id
            type: string
          - scope: span
            name: block.id
            type: string
          - scope: span
            name: peer.address
            type: string
          - scope: span
            name: store.addr
            type: string
          - scope: span
            name: target
            type: string
          # Resource-level attributes, sorted by effective size in the former generic key:value columns
          - scope: resource
            name: host.name
            type: string
          - scope: resource
            name: telemetry.sdk.language
            type: string
          - scope: resource
            name: telemetry.sdk.name
            type: string
          - scope: resource
            name: service.version
            type: string
      backend: s3
      pool:
        max_workers: 400
        queue_depth: 20000
      s3:
        bucket: foo
        endpoint: bar
    admin:
      backend: s3

  memcached:
    enabled: true
    image:
      repository: library/memcached
    replicas: 1
    resources:
      requests:
        cpu: 0.4
        memory: "1500Mi"
      limits:
        memory: "1500Mi"
  memcachedExporter:
    enabled: false
  metaMonitoring:
    serviceMonitor:
      enabled: true
  prometheusRule:
    enabled: false
  gateway:
    enabled: true
    replicas: 2
    resources:
      requests:
        cpu: 0.3
        memory: 500Mi
      limits:
        memory: 500Mi
    ingress:
      enabled: false
    basicAuth:
      enabled: false
      username: null
      password: null
      existingSecret: null
joe-elliott commented 1 week ago

We changed the way TraceQL metrics work in Tempo 2.6 to base historical requests off of a set of RF1 blocks written to the backend by the metrics generators:

https://grafana.com/docs/tempo/latest/release-notes/v2-6/#operational-change-for-traceql-metrics

This will greatly improve TraceQL metrics speed, but will be a temporary increase in TCO due to additional blocks in the backend. We are attempting to address this holistically by rearchitecting Tempo around an RF1 architecture for both metrics and search.

Expect updates with the next few releases.

ValentinLvr commented 1 week ago

Thanks for the explanation !

I just set the flush_to_storage parameter to true and I'm, now, able to see historical data from the backend.

...
metrics_generator:
  processor:
    local_blocks:
      flush_to_storage: true
...

Maybe it's worth mentioning it to the breaking changes part here ?

joe-elliott commented 1 week ago

Yup. Good call out.

@knylander-grafana do you mind sneaking this in the breaking changes section when you get a chance?

knylander-grafana commented 1 week ago

Will do! Thank you, @ValentinLvr for the thorough issue!

knylander-grafana commented 5 days ago

Added here: https://github.com/grafana/tempo/releases/tag/v2.6.0 Image