Open emanuelioanm opened 2 months ago
I've been facing random query timeouts in our Loki installation as well and was able to solve the issue by increasing the number of memcached pods used for chunks and results, as well as bumping the concurrency.
Namely
chunksCache:
replicas: 2
parallelism: 64
writebackParallelism: 2
resultsCache:
replicas: 2
timeout: "2s"
writebackParallelism: 2
Maybe there's an issue with the default values in the Helm chart being too low.
Someone with deeper Loki expertise will likely be able to assess whether this hides a bug in the read part of the Loki backend (ie. not coping well with exhausted memcached connections).
Describe the bug I am testing out a simple Loki configuration based on 2-S3-Cluster-Example.yaml It works well on local (using minio for storage) but behaves a bit differently when trying on AWS instances. For some reason, I get timeouts when running queries after starting the Loki docker container or when running queries after a period of inactivity (by inactivity I mean a period where I am continuously ingesting data through promtail but not running any queries). This goes on for something like 5 minutes, after which queries just start working. It doesn't matter what the query range is, it behaves the exact same way for 6h and for 30d. Log volume is fairly small, just a couple tens of Mb in total so small range queries could be processing tens or hundreds of Kb and still time out
Pasting my configuration below:
I am using Loki as a datasource for Grafana (running on the same instance). Labels are few and with low cardinality, so I don't think that should be an issue (i've got 4 indexed labels:
environment
,job
,service_name
,status
).Really hope I am doing something wrong in the config above. I haven't changed any timeouts in the Grafana datasource config yet as this shouldn't happen on such low data volumes and I don't want to mask the issue this way.
To Reproduce Steps to reproduce the behavior:
{job="nginx"}
over the last 24 hours (which processed 139.3 KiB of data in my case)loki net/http: request canceled (Client.Timeout exceeded while awaiting headers)
after 30sExpected behavior: Queries (especially small ones processing just a couple thousand log lines) should work without timing out.
Environment: