grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.
https://grafana.com/oss/tempo/
GNU Affero General Public License v3.0
4.04k stars 524 forks source link

error writing object to s3 backend #4361

Open benmathews opened 5 days ago

benmathews commented 5 days ago

Describe the bug On October 24, my tempo-ingester pods started throwing the below errors and ingester and compacter latency increased quite a bit (couple hundred ms to multiple seconds).

level=error caller=flush.go:233 org_id=single-tenant msg="error performing op in flushQueue" op=1 block=77c398c8-cc47-4764-a995-fe0de5760e7d attempts=1 err="error copying block from local to remote backend: error writing object to s3 backend, object tempo/single-tenant/77c398c8-cc47-4764-a995-fe0de5760e7d/data.parquet: context deadline exceeded"

This did not align with a software, config, or network change that I can tell. We are still writing to S3, but slowly. I can't tell if the deadline exceeded blocks get retried or dropped.

To Reproduce Steps to reproduce the behavior: Normal operation reproduces the behavior

Environment:

➜ helm history tempo
REVISION    UPDATED                     STATUS      CHART                       APP VERSION DESCRIPTION     
140         Mon Oct 14 16:26:31 2024    superseded  tempo-distributed-1.18.4    2.6.0       Upgrade complete
141         Wed Nov 20 14:32:00 2024    superseded  tempo-distributed-1.22.1    2.6.0       Upgrade complete
142         Wed Nov 20 14:57:16 2024    deployed    tempo-distributed-1.22.1    2.6.0       Upgrade complete

Additional Context values.yaml overrides

USER-SUPPLIED VALUES:
compactor:
  config:
    compaction:
      max_time_per_tenant: 15m
  replicas: 12
  resources:
    requests:
      cpu: 600m
      memory: 2Gi
distributor:
  replicas: 6
  resources:
    requests:
      cpu: 2
      memory: 1500Mi
ingester:
  persistence:
    enabled: true
    inMemory: false
    size: 30Gi
    storageClass: null
  replicas: 30
  resources:
    requests:
      cpu: 1
      memory: 5Gi
memcached:
  replicas: 3
  resources:
    requests:
      cpu: 100m
      memory: 100Mi
memcachedExporter:
  enabled: true
metaMonitoring:
  serviceMonitor:
    enabled: true
metricsGenerator:
  enabled: false
prometheusRule:
  enabled: true
querier:
  config:
    max_concurrent_queries: 40
    search:
      query_timeout: 1m
    trace_by_id:
      query_timeout: 1m
  replicas: 40
  resources:
    requests:
      cpu: 50m
      memory: 2Gi
query_frontend:
  max_outstanding_per_tenant: 4000
queryFrontend:
  config:
    search:
      concurrent_jobs: 5000
  replicas: 2
  resources:
    requests:
      cpu: 10m
      memory: 150Mi
reportingEnabled: false
server:
  http_server_read_timeout: 4m
  http_server_write_timeout: 4m
storage:
  trace:
    backend: s3
    pool:
      queue_depth: 50000
    s3:
      access_key: *******************
      bucket: *****************
      endpoint: s3.us-west-2.amazonaws.com
      prefix: tempo
      secret_key: *********************
tempo:
  structuredConfig:
    overrides:
      defaults:
        ingestion:
          burst_size_bytes: 800000000
          max_traces_per_user: 3000000
          rate_limit_bytes: 600000000
traces:
  otlp:
    grpc:
      enabled: true
    http:
      enabled: true
joe-elliott commented 4 days ago

If compactors and ingesters were simultaneously having issues speaking with object storage this suggests a networking or object storage issue.

I can't tell if the deadline exceeded blocks get retried or dropped.

They are retried.