grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.87k stars 3.45k forks source link

Loki query labels or logs timeout with 1 minute #11228

Open wenlinhe opened 12 months ago

wenlinhe commented 12 months ago

Describe the bug Loki query labels or logs timeout with 1 minute after Loki upgrade in recent weeks. Previously the labels query was very quick, but now it hangs there and timeout when 1 minute.

time logcli labels --since=24h
2023/11/14 18:17:41 https://<url>/loki/api/v1/labels?end=1700014661166745910&start=1699928261166745910
2023/11/14 18:18:41 error sending request Get "https://<url>/loki/api/v1/labels?end=1700014661166745910&start=1699928261166745910": EOF
2023/11/14 18:18:41 Error doing request: run out of attempts while querying the server

real    1m0.022s
user    0m0.144s
sys 0m0.056s

Expected behavior

  1. Labels query should be very quick
  2. How to change this 1 minute timeout? I already set timeout=20m in config, see the end
  3. How to solve these too many "context canceled"

Environment:

Detail

I got the same problem in the recent weeks after Loki upgrade. My Loki installed in Openshift(Kubernetes), v2.9.2.

Can only get recent hours labels, but failed for more duration and timeout error when 1 minute (not sure which controls this 1 minute timeout). I don't have the new Promtail label added. When querying label, there are some "context canceled" in query frontend, scheduler, querier pods, see below:

Frontend: level=info ts=2023-11-15T02:10:26.847814729Z caller=frontend_scheduler_worker.go:107 msg="adding connection to scheduler" addr=172.21.13.20:9095 level=debug ts=2023-11-15T02:10:26.848385759Z caller=ring_watcher.go:93 component=frontend-scheduler-worker msg="removing connection to address: 172.21.10.68:9095" level=info ts=2023-11-15T02:10:26.848412309Z caller=frontend_scheduler_worker.go:134 msg="removing connection to scheduler" addr=172.21.10.68:9095 level=debug ts=2023-11-15T02:10:26.848508362Z caller=frontend_scheduler_worker.go:282 msg="stream context finished" err="context canceled" level=debug ts=2023-11-15T02:10:26.848570548Z caller=frontend_scheduler_worker.go:282 msg="stream context finished" err="context canceled" level=debug ts=2023-11-15T02:10:26.848612475Z caller=frontend_scheduler_worker.go:282 msg="stream context finished" err="context canceled"

Scheduler:

level=debug ts=2023-11-15T02:11:56.172109206Z caller=scheduler.go:405 msg="querier connected" querier=loki-loki-distributed-querier-5f9797d798-ckxbc level=debug ts=2023-11-15T02:11:56.172123388Z caller=scheduler.go:405 msg="querier connected" querier=loki-loki-distributed-querier-5f9797d798-ckxbc level=debug ts=2023-11-15T02:11:56.172323748Z caller=grpc_logging.go:76 method=/schedulerpb.SchedulerForQuerier/QuerierLoop duration=27.090504991s err="context canceled" msg=gRPC level=debug ts=2023-11-15T02:11:56.172371672Z caller=grpc_logging.go:76 method=/schedulerpb.SchedulerForQuerier/QuerierLoop duration=27.089859711s err="context canceled" msg=gRPC

Querier:

level=error ts=2023-11-15T02:14:10.520057737Z caller=scheduler_processor.go:106 msg="error processing requests from scheduler" err="rpc error: code = Canceled desc = context canceled" addr=172.21.98.34:9095 level=error ts=2023-11-15T02:14:10.520106449Z caller=scheduler_processor.go:106 msg="error processing requests from scheduler" err="rpc error: code = Canceled desc = context canceled" addr=172.21.98.34:9095 level=debug ts=2023-11-15T02:14:10.509859034Z caller=util.go:38 msg="querier worker context has been canceled, waiting until there's no inflight query" level=debug ts=2023-11-15T02:14:10.509993621Z caller=util.go:38 msg="querier worker context has been canceled, waiting until there's no inflight query" level=debug ts=2023-11-15T02:14:10.510809779Z caller=util.go:38 msg="querier worker context has been canceled, waiting until there's no inflight query"

my config below:

common:
  compactor_address: http://loki-loki-distributed-compactor:3100
auth_enabled: false
chunk_store_config:
  max_look_back_period: 744h
  chunk_cache_config:
    embedded_cache:
      enabled: false
    memcached:
      batch_size: 1024
      parallelism: 20960
    memcached_client:
      consistent_hash: true
      addresses: dnssrv+_memcached-client._tcp.loki-loki-distributed-memcached-chunks.<xxx>..svc.cluster.local
  write_dedupe_cache_config:
    memcached:
      batch_size: 1024
      parallelism: 20960
    memcached_client:
      consistent_hash: true
      addresses: dnssrv+_memcached-client._tcp.loki-loki-distributed-memcached-index-writes.<xxx>.svc.cluster.local
compactor:
  compaction_interval: 30m
  retention_delete_delay: 1h
  retention_delete_worker_count: 100
  retention_enabled: true
  shared_store: s3
  working_directory: /var/loki/retention
distributor:
  ring:
    kvstore:
      store: memberlist
querier:
  max_concurrent: 2976
  query_timeout: 20m
  engine:
    timeout: 20m
  #multi_tenant_queries_enabled:
frontend:
  compress_responses: true
  max_outstanding_per_tenant: 200000000
  scheduler_address: loki-loki-distributed-query-scheduler:9095
  log_queries_longer_than: 5s
  tail_proxy_url: http://loki-loki-distributed-querier:3100
frontend_worker:
  scheduler_address: loki-loki-distributed-query-scheduler:9095
  parallelism: 29760
  match_max_concurrent: true
ingester:
  chunk_block_size: 262144
  max_chunk_age: 1440h
  chunk_encoding: snappy
  chunk_idle_period: 30m
  chunk_retain_period: 20m
  lifecycler:
    ring:
      kvstore:
        store: memberlist
      replication_factor: 2
  max_transfer_retries: 0
  wal:
    dir: /var/loki/wal
ingester_client:
  remote_timeout: 10s    
limits_config:
  enforce_metric_name: false
  ingestion_burst_size_mb: 1000
  ingestion_rate_mb: 200
  ingestion_rate_strategy: local
  max_cache_freshness_per_query: 10m
  reject_old_samples: false
  per_stream_rate_limit: 200M
  per_stream_rate_limit_burst: 1000M  
  max_query_parallelism: 20960
  max_query_series: 20960
  reject_old_samples_max_age: 168h
  retention_period: 8784h
  split_queries_by_interval: 20m
  query_timeout: 20m
memberlist:
  join_members:
  - loki-loki-distributed-memberlist
query_range:
  parallelise_shardable_queries: false
  align_queries_with_step: true
  cache_results: true
  max_retries: 50
  results_cache:
    cache:
      enable_fifocache: true
      fifocache:
        max_size_items: 2048
        validity: 72h
      memcached_client:
        host: loki-loki-distributed-memcached-frontend
        service: memcached-client
        consistent_hash: true
        max_idle_conns: 16
        timeout: 15s
        update_interval: 1m
      embedded_cache:
        enabled: false
        ttl: 72h
query_scheduler:
  use_scheduler_ring: true
ruler:
  alertmanager_url: https://alertmanager.xx
  external_url: https://alertmanager.xx
  ring:
    kvstore:
      store: memberlist
  rule_path: /tmp/loki/scratch
  storage:
    local:
      directory: /etc/loki/rules
    type: local
schema_config:
  configs:
  - from: "2020-09-07"
    index:
      period: 24h
      prefix: loki_index_
    object_store: aws
    schema: v11
    store: boltdb-shipper
server:
  http_listen_port: 3100
  http_server_read_timeout: 600s
  http_server_write_timeout: 600s
  http_server_idle_timeout: 1200s
  grpc_server_max_recv_msg_size: 104857600
  grpc_server_max_send_msg_size: 104857600
  grpc_server_max_concurrent_streams: 10000
index_gateway:
  mode: simple
storage_config:
  aws:
    access_key_id: 
    bucketnames: 
    s3: s3://us-east-2
    secret_access_key: 
  boltdb_shipper:
    active_index_directory: /var/loki/index
    cache_location: /var/loki/cache
    cache_ttl: 168h
    shared_store: s3
    index_gateway_client:
      server_address: dns:///loki-loki-distributed-index-gateway:9095
  index_queries_cache_config:
    memcached:
      batch_size: 400
      parallelism: 2096
    memcached_client:
      addresses: dnssrv+_memcached-client._tcp.loki-loki-distributed-memcached-index-queries.<xxx>.svc.cluster.local
      consistent_hash: true
  filesystem:
    directory: /var/loki/chunks
table_manager:
  retention_deletes_enabled: false
  retention_period: 0s
duj4 commented 4 months ago

@wenlinhe hi, did you get this fixed?

alexmarshalldw commented 2 months ago

I'm encountering this bug as well. @wenlinhe Have you managed to make any progress on resolving this ?