grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.
https://grafana.com/oss/tempo/
GNU Affero General Public License v3.0
3.93k stars 509 forks source link

Periodic errors during searching for traces #3966

Open Andrewmakmaer opened 1 month ago

Andrewmakmaer commented 1 month ago

Describe the bug I setup tempo on two nodes as scalable-single-binary. Some time later, when I try to request a trace, I get an error like this:

[root@tr-tempo-02v] tempo-cli --config-file=/etc/tempo/config.yml query api trace-id http://tr-tempo-02v:3200 66bb0b8bd2dfcabe4dc8be0eb9e808bf
tempo-cli: error: main.queryTraceIDCmd.Run(): GET request to http://tr-tempo-02v:3200/api/traces/66bb0b8bd2dfcabe4dc8be0eb9e808bf failed with response: 500 body: error finding trace by id, blockID: b42d164b-a051-48d2-a621-337cfb47a88e: error retrieving bloom bloom-1 (single-tenant, b42d164b-a051-48d2-a621-337cfb47a88e): does not exist

The problem appears periodically and quite often either on two nodes at once, or only one, or, in rare cases, on none of the nodes. The errors include the same block: b42d164b-a051-48d2-a621-337cfb47a88e

I can get list information about this block

[root@tr-tempo-02v] tempo-cli --config-file=/etc/tempo/config.yml list block single-tenant b42d164b-a051-48d2-a621-337cfb47a88e
ID            :  b42d164b-a051-48d2-a621-337cfb47a88e
Version       :  vParquet3
Total Objects :  203319
Data Size     :  29 MB
Encoding      :  none
Level         :  0
Window        :  478598
Start         :  2024-08-06 17:15:46
End           :  2024-08-06 17:26:04
Duration      :  10m18s
Age           :  194h21m48s

But i can't get no more. (This block does not exist in s3 storage and i can't find this on local machines)

[root@tr-tempo-02v] tempo-cli --config-file=/etc/tempo/config.yml view schema single-tenant b42d164b-a051-48d2-a621-337cfb47a88e

***********     block meta    *****************

&{Version:vParquet3 BlockID:b42d164b-a051-48d2-a621-337cfb47a88e MinID:[0 0 28 50 131 157 0 73 171 96 250 31 119 213 31 178] MaxID:[255 255 146 212 218 113 221 161 17 132 15 243 199 251 30 89] TenantID:single-tenant StartTime:2024-08-06 17:15:46 +0300 MSK EndTime:2024-08-06 17:26:04 +0300 MSK TotalObjects:203319 Size:28644835 CompactionLevel:0 Encoding:none IndexPageSize:0 TotalRecords:1 DataEncoding: BloomShardCount:3 FooterSize:14006 DedicatedColumns:[] ReplicationFactor:0}
tempo-cli: error: main.viewSchemaCmd.Run(): reading magic header of parquet file: error in range read from s3 backend, bucket: tempostoraje, objName: single-tenant/b42d164b-a051-48d2-a621-337cfb47a88e/data.parquet: The specified key does not exist.

[root@tr-tempo-02v] tempo-cli --config-file=/etc/tempo/config.yml analyse block single-tenant b42d164b-a051-48d2-a621-337cfb47a88e
tempo-cli: error: main.analyseBlockCmd.Run(): reading magic header of parquet file: error in range read from s3 backend, bucket: tempostoraje, objName: single-tenant/b42d164b-a051-48d2-a621-337cfb47a88e/data.parquet: The specified key does not exist.

To Reproduce Steps to reproduce the behavior:

  1. Start Tempo: version="(version=2.5.0, branch=HEAD, revision=46dad3411)"
  2. Perform Operations: Read

Expected behavior Сonsistently successful trace search

Environment:

Additional Context it's my configuration file:

stream_over_http_enabled: true
usage_report:
  reporting_enabled: false
server:
  http_listen_port: 3200
  log_level: info

query_frontend:
  multi_tenant_queries_enabled: false
  max_retries: 3
  trace_by_id:
    duration_slo: 5s

distributor:
  ring:
    kvstore:
      store: memberlist
  receivers:  
    jaeger:                          
      protocols:                      
        thrift_http:                   
        grpc:                      
        thrift_binary:
        thrift_compact:
    zipkin:
    otlp:
      protocols:
        http:
        grpc:
    opencensus:

ingester:
  max_block_duration: 5m
  flush_all_on_shutdown: true

compactor:
  ring:
    kvstore:
      store: memberlist
  compaction:
    block_retention: 960h                # overall Tempo trace retention. set for demo purposes
    compacted_block_retention: 30m

memberlist:
  abort_if_cluster_join_fails: false
  bind_port: 7946
  join_members:
    - tr-tempo-02v:7946
    - tr-tempo-03v:7946

metrics_generator:
  registry:
    external_labels:
      source: tempo
  storage:
    path: /tmp/tempo/generator/wal
    remote_write:
      - url: http://vminser:8480/insert/0/prometheus/api/v1/write
        send_exemplars: true

storage:
  trace:
    backend: s3                     
    s3:
      endpoint: tempo.ipt-lab.ru:9002
      bucket: tempostoraje
      forcepathstyle: true
      insecure: true
      access_key: key
      secret_key: key
    wal:
      path: /tmp/tempo/wal             # where to store the the wal locally
    local:
      path: /tmp/tempo/blocks

querier:
  frontend_worker:
    frontend_address: localhost:9095
  search:
    query_timeout: 1m

overrides:
  defaults:
    metrics_generator:
      processors: [service-graphs, span-metrics] # enables metrics generator

I have experimented a lot with the tempo and minio configuration, perhaps the block could have been removed bypassing tempo, I would like to at least understand how to fix this problem.

joe-elliott commented 1 month ago

But i can't get no more. (This block does not exist in s3 storage and i can't find this on local machines)

It seems like the tempo-cli found it? Are you sure its not in s3 storage? Its possible you have a corrupt block. Simply removing it may solve your issue.