grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.
https://grafana.com/oss/tempo/
GNU Affero General Public License v3.0
4.02k stars 521 forks source link

tempo memcached via helm deployment gives i/o timeout #3686

Closed sundar-ka closed 3 months ago

sundar-ka commented 5 months ago

Describe the bug We are new to tempo, and we recently deployed tempo via helm chart in k8s cluster. We use grafana to view traces. After the deployment we noticed that we are getting timeouts when the tempo querier component tries to get/set keys from memcached. level=error ts=2024-05-17T10:34:38.433823747Z caller=memcached.go:153 msg="Failed to get keys from memcached" err="read tcp 10.20.156.103:37050->172.20.198.213:11211: i/o timeout" level=error ts=2024-05-17T10:34:38.434170592Z caller=memcached.go:153 msg="Failed to get keys from memcached" err="read tcp 10.20.156.103:37048->172.20.198.213:11211: i/o timeout"

level=error ts=2024-05-17T10:36:38.334049743Z caller=memcached.go:236 msg="failed to put to memcached" name=parquet-footer|bloom|frontend-search err="server=172.20.198.213:11211: read tcp 10.20.156.103:39990->172.20.198.213:11211: i/o timeout" level=error ts=2024-05-17T10:36:38.334101886Z caller=memcached.go:236 msg="failed to put to memcached" name=parquet-footer|bloom|frontend-search err="server=172.20.198.213:11211: read tcp 10.20.156.103:36876->172.20.198.213:11211: i/o timeout"

But when tried to do telnet using service cluster ip to the port 11211 its able to connect telnet 172.20.198.213 11211

Hence we are able to query traces but they run slow as they are not being cached in the memcached. Need your help please

To Reproduce Deploy tempo-distributed chart via helm deployment in K8s Chart version - v1.9.9 App version - v2.4.1

Expected behavior Timeouts shouldn't appear and the querier should be able to get/set keys from memcache

Environment:

Additional Context Tempo configuration that gives timeout

cache:
  caches:
  - memcached:
      consistent_hash: true
      host: 'tempo-distributed-memcached'
      service: memcached-client
      timeout: 500ms
    roles:
    - parquet-footer
    - bloom
    - frontend-search
compactor:
  compaction:
    block_retention: 48h
    compacted_block_retention: 1h
    compaction_cycle: 30s
    compaction_window: 1h
    max_block_bytes: 107374182400
    max_compaction_objects: 6000000
    max_time_per_tenant: 5m
    retention_concurrency: 10
    v2_in_buffer_bytes: 5242880
    v2_out_buffer_bytes: 20971520
    v2_prefetch_traces_count: 1000
  ring:
    kvstore:
      store: memberlist
distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318
  ring:
    kvstore:
      store: memberlist
ingester:
  lifecycler:
    ring:
      kvstore:
        store: memberlist
      replication_factor: 3
    tokens_file_path: /var/tempo/tokens.json
memberlist:
  abort_if_cluster_join_fails: false
  bind_addr: []
  bind_port: 7946
  gossip_interval: 1s
  gossip_nodes: 2
  gossip_to_dead_nodes_time: 30s
  join_members:
  - tempo-distributed-gossip-ring.tempo.svc.cluster.local:7946
  leave_timeout: 5s
  left_ingesters_timeout: 5m
  max_join_backoff: 1m
  max_join_retries: 10
  min_join_backoff: 1s
  node_name: ""
  packet_dial_timeout: 5s
  packet_write_timeout: 5s
  pull_push_interval: 30s
  randomize_node_name: false
  rejoin_interval: 0s
  retransmit_factor: 2
  stream_timeout: 10s
metrics_generator:
  metrics_ingestion_time_range_slack: 30s
  processor:
    service_graphs:
      dimensions: []
      histogram_buckets:
      - 0.1
      - 0.2
      - 0.4
      - 0.8
      - 1.6
      - 3.2
      - 6.4
      - 12.8
      max_items: 10000
      wait: 10s
      workers: 10
    span_metrics:
      dimensions: []
      histogram_buckets:
      - 0.002
      - 0.004
      - 0.008
      - 0.016
      - 0.032
      - 0.064
      - 0.128
      - 0.256
      - 0.512
      - 1.02
      - 2.05
      - 4.1
  registry:
    collection_interval: 15s
    external_labels: {}
    stale_duration: 15m
  ring:
    kvstore:
      store: memberlist
  storage:
    path: /var/tempo/wal
    remote_write: []
    remote_write_flush_deadline: 1m
  traces_storage:
    path: /var/tempo/traces
multitenancy_enabled: true
overrides:
  per_tenant_override_config: /runtime-config/overrides.yaml
querier:
  frontend_worker:
    frontend_address: tempo-distributed-query-frontend-discovery:9095
  max_concurrent_queries: 20
  search:
    external_backend: null
    external_endpoints: []
    external_hedge_requests_at: 8s
    external_hedge_requests_up_to: 2
    prefer_self: 10
    query_timeout: 300s
  trace_by_id:
    query_timeout: 300s
query_frontend:
  max_outstanding_per_tenant: 2000
  max_retries: 2
  search:
    concurrent_jobs: 1000
    target_bytes_per_job: 104857600
  trace_by_id:
    hedge_requests_at: 2s
    hedge_requests_up_to: 2
    query_shards: 50
server:
  grpc_server_max_recv_msg_size: 4194304
  grpc_server_max_send_msg_size: 4194304
  http_listen_port: 3100
  http_server_read_timeout: 30s
  http_server_write_timeout: 30s
  log_format: logfmt
  log_level: info
storage:
  trace:
    backend: s3
    blocklist_poll: 5m
    local:
      path: /var/tempo/traces
    pool:
      max_workers: 400
      queue_depth: 20000
    s3:
      bucket: our-test-bucket
      endpoint: s3.us-east-1.amazonaws.com
      prefix: traces/
      region: us-east-1
    wal:
      path: /var/tempo/wal
usage_report:
  reporting_enabled: true
edgarkz commented 5 months ago

same here. I had discussion open with some suggestions and workarounds - pls take a look as well https://github.com/grafana/tempo/discussions/3553

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.