grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
24.04k stars 3.47k forks source link

Loki transient query timeout after inactivity #14240

Open emanuelioanm opened 2 months ago

emanuelioanm commented 2 months ago

Describe the bug I am testing out a simple Loki configuration based on 2-S3-Cluster-Example.yaml It works well on local (using minio for storage) but behaves a bit differently when trying on AWS instances. For some reason, I get timeouts when running queries after starting the Loki docker container or when running queries after a period of inactivity (by inactivity I mean a period where I am continuously ingesting data through promtail but not running any queries). This goes on for something like 5 minutes, after which queries just start working. It doesn't matter what the query range is, it behaves the exact same way for 6h and for 30d. Log volume is fairly small, just a couple tens of Mb in total so small range queries could be processing tens or hundreds of Kb and still time out

Pasting my configuration below:

auth_enabled: false

server:
  http_listen_port: 3100

common:
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: inmemory
  instance_addr: 127.0.0.1
  replication_factor: 1

ingester:
  max_chunk_age: 72h
  wal:
    dir: /loki/wal

schema_config:
  configs:
    - from: "2020-10-24"
      index:
        period: 24h
        prefix: index_
      object_store: s3
      schema: v13
      store: tsdb

storage_config:
  tsdb_shipper:
    active_index_directory: /loki/tsdb-index
    cache_location: /loki/tsdb-cache
  aws:
    s3: s3://<<REGION>>/<<BUCKET_NAME>>
    s3forcepathstyle: true

query_scheduler:
  # the TSDB index dispatches many more, but each individually smaller, requests.
  # We increase the pending request queue sizes to compensate.
  max_outstanding_requests_per_tenant: 32768

querier:
  # Each `querier` component process runs a number of parallel workers to process queries simultaneously.
  # You may want to adjust this up or down depending on your resource usage
  # (more available cpu and memory can tolerate higher values and vice versa),
  # but we find the most success running at around `16` with tsdb
  max_concurrent: 16

compactor:
  working_directory: /loki/compactor
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2160h
  retention_delete_worker_count: 150
  delete_request_store: s3

limits_config:
  reject_old_samples: false
  max_query_length: 2160h # Default: 721h
  retention_period: 2160h
  ingestion_rate_mb: 64
  ingestion_burst_size_mb: 128
  per_stream_rate_limit: 8MB
  per_stream_rate_limit_burst: 32MB

I am using Loki as a datasource for Grafana (running on the same instance). Labels are few and with low cardinality, so I don't think that should be an issue (i've got 4 indexed labels: environment, job, service_name, status).

Really hope I am doing something wrong in the config above. I haven't changed any timeouts in the Grafana datasource config yet as this shouldn't happen on such low data volumes and I don't want to mask the issue this way.

To Reproduce Steps to reproduce the behavior:

  1. Started Loki (grafana/loki:3.2.0)
  2. Tried a simple query such as {job="nginx"} over the last 24 hours (which processed 139.3 KiB of data in my case)
  3. Got loki net/http: request canceled (Client.Timeout exceeded while awaiting headers) after 30s
  4. Retried a few times over the next 5 minutes until it started working, after which any query range works fine (e.g. 30d which processed 30.1 MiB of data in my case)
  5. Closed Grafana and waited around half a day
  6. Got the exact same behavior starting at step 2

Expected behavior: Queries (especially small ones processing just a couple thousand log lines) should work without timing out.

Environment:

j-vizcaino commented 1 month ago

I've been facing random query timeouts in our Loki installation as well and was able to solve the issue by increasing the number of memcached pods used for chunks and results, as well as bumping the concurrency.

Namely

chunksCache:
  replicas: 2
  parallelism: 64
  writebackParallelism: 2
resultsCache:
  replicas: 2
  timeout: "2s"
  writebackParallelism: 2

Maybe there's an issue with the default values in the Helm chart being too low.

Someone with deeper Loki expertise will likely be able to assess whether this hides a bug in the read part of the Loki backend (ie. not coping well with exhausted memcached connections).