grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.
https://grafana.com/oss/tempo/
GNU Affero General Public License v3.0
4.03k stars 522 forks source link

Tempo drops spans when there is too many unconsumed spans on Kafka topic #4122

Open d-mankowski-synerise opened 1 month ago

d-mankowski-synerise commented 1 month ago

We're pushing spans (with OpenTelemetry Collector) to Kafka, and consuming them with Tempo 2.5.0. We scaled down (to 0 replicas) an entire Tempo deployment for ~1h, and during that time a lag for consumer group otel-collector (i.e. Tempo distributor) started to grow:

Image

When we scaled back the deployment, I expected Tempo to slowly consume all spans/traces from Kafka topic and ingest them to the blob storage. Yet, it seems that Tempo consumed all of them almost instantaneously, and dropped most of them:

Image

due to exceeding limit of live traces:

Image

Logs from tempo-distributor:

level=warn ts=2024-09-25T10:14:09.830688721Z caller=instance.go:49 msg="LIVE_TRACES_EXCEEDED: max live traces exceeded for tenant single-tenant: per-user traces limit (local: 30000 global: 0 actual local: 30000) exceeded"
level=warn ts=2024-09-25T10:14:09.83069342Z caller=instance.go:49 msg="LIVE_TRACES_EXCEEDED: max live traces exceeded for tenant single-tenant: per-user traces limit (local: 30000 global: 0 actual local: 30000) exceeded"

I don't see a way to throttle a number of fetched messages in kafkareceiver (as it wouldn't make much sense, though), nor in Tempo's helm chart.

Is there a way to fix this behavior? Of course one option is to bump live traces limit, but IMO it is a workaround, not a proper solution.

Tempo's config (I skipped resource requests/limits, tolerations, node selectors, etc., as it is not important):

---
rbac:
  create: true

ingester:
  replicas: 3
  persistence:
    enabled: true
    size: 32Gi
    storageClass: managed-premium

metricsGenerator:
  enabled: true
  replicas: 2
  config:
    storage:
      remote_write_add_org_id_header: false # We have multitenancy disabled in Tempo because of Kafka, so don't forward default org ID to Mimir.
      remote_write:
        - url: https://xxx.com:8080/api/v1/push
          remote_timeout: 30s
          send_exemplars: true
          headers:
            X-Scope-OrgID: xxx

distributor:
  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 4
    targetCPUUtilizationPercentage: 60
  receivers:
    otlp:
      grpc:
        max_recv_msg_size_mib: 15

compactor:
  replicas: 2
  config:
    compaction:
      block_retention: 888h # 7 days hot, 30 days cool

querier:
  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 4
    targetCPUUtilizationPercentage: 60
  config:
    trace_by_id:
      # -- Timeout for trace lookup requests
      query_timeout: 10s
    search:
      # -- Timeout for search requests
      query_timeout: 30s
    # -- This value controls the overall number of simultaneous subqueries that the querier will service at once. It does not distinguish between the types of queries.
    max_concurrent_queries: 20

queryFrontend:
  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 4
    targetCPUUtilizationPercentage: 60

multitenancyEnabled: false

traces:
  jaeger:
    grpc:
      enabled: false
    thriftBinary:
      enabled: false
    thriftCompact:
      enabled: false
    thriftHttp:
      enabled: false
  zipkin:
    enabled: false
  otlp:
    http:
      enabled: false
    grpc:
      enabled: false
  opencensus:
    enabled: false
  # -- Enable Tempo to ingest traces from Kafka. Reference: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/kafkareceiver
  kafka:
    protocol_version: 2.5.0 # Kafka protocol version
    brokers: 1.2.3.4:9092,4.3.2.1:9092 
    topic: otlp_spans
    encoding: otlp_proto
    group_id: otel-collector
    client_id: otel-collector
    initial_offset: latest

server:
  logLevel: info
  logFormat: logfmt
  grpc_server_max_recv_msg_size: 30000000
  grpc_server_max_send_msg_size: 30000000
  http_server_read_timeout: 30s
  http_server_write_timeout: 30s

storage:
  trace:
    backend: azure
    azure:
      container_name: traces
      storage_account_name: xxx
      use_federated_token: true
    blocklist_poll_tenant_index_builders: 1
    blocklist_poll_jitter_ms: 500
    pool:
      max_workers: 400
      queue_depth: 20000

minio:
  enabled: false

memcached:
  enabled: true
  replicas: 2

memcachedExporter:
  enabled: true

metaMonitoring:
  serviceMonitor:
    enabled: true
    namespaceSelector:
      matchNames:
        - tempo
    interval: 30s
    scrapeTimeout: 20s

gateway:
  enabled: true
  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 4
    targetCPUUtilizationPercentage: 60
  ingress:
    enabled: true
    ingressClassName: nginx
    hosts:
      - host: xxx.com
        paths:
          - path: /
            pathType: Prefix
    tls:
      - secretName: tempo-gateway-tls-certificate
        hosts:
          - xxx.com

global_overrides:
  defaults:
    ingestion:
      max_traces_per_user: 30000 # Default 10k.
    global:
      max_bytes_per_trace: 30000000 # 30MB. Default 5MB.
    metrics_generator:
      processors:
        - service-graphs
        - span-metrics
joe-elliott commented 1 month ago

This is an interesting issue that I don't think we have much experience with internally. We never run Tempo consuming directly from a kafka topic as you have configured.

Does the otel kafka receiver respond to any return that will help? Can we return a special error that will tell it not to advance its cursor for instance b/c the data was not successfully saved?

If there is some feature we can "exploit" we could make a change Tempo side to potentially slow ingestion down and correctly consume a backed up queue.

jkrzemin commented 1 month ago

Do you have any recommendations on what could be the proxy between Kafka and tempo? We are using this in production and one of the huge benefits of using Kafka this way is the buffering it provides - with this bug, this setup doesn't live up to the expectations.

joe-elliott commented 1 month ago

I would recommend using the otel collector:

kafka -> otel collector -> tempo

hopefully, the collector does a better job of draining the queue?

you may also be interested in an upcoming rearchitecture which would add a queue to tempo directly and may allow you to drop your external kafka.

d-mankowski-synerise commented 1 month ago

drop your external kafka.

We're using (the same) Kafka cluster as a buffer for logs as well (fluentbit -> kafka -> promtail -> loki) and one of the benefits for having Kafka is that we don't have to pay for the load balancer traffic (i.e. Loki and Tempo are deployed on a different K8s cluster than the apps that send telemetry signals) between OTEL collector and Loki or Tempo (since Kafka driver used by OTEL collector has built-in autodiscovery). Although, I don't think in case of Tempo it would be a significant cost.