Open itsnotv opened 2 years ago
Hi, I resolved the pb for my part by increasing two default value with:
querier: max_concurrent: 2048 query_scheduler: max_outstanding_requests_per_tenant: 2048
It's not perfect, but this error, helped me to understand better and better the architecture. Next step, used query_frontend (that is not mandatory but active if we add something in config) to do the job (Queueing) and of coure decrease as possible these values for my home docker service.
Hi, I come back after tried many, many settings.
I solved my pb with:
My dashboard is complete now in 5s :) Without the splitting parameter, i had always 429 error for 1 or 3 graph and a rendered in 3min
It works for me because i had a lot of small request. Too much for my docker, loki process. Reduce them was the solution. Increase worker, frontend, parallelism or timeout was a bad idea.
For completeness, here's the needed config
query_range:
split_queries_by_interval: 24h
frontend:
max_outstanding_per_tenant: 1024
For completeness, here's the needed config
query_range: split_queries_by_interval: 24h frontend: max_outstanding_per_tenant: 1024
This helped partially, I still see the error every now and then.
You can raise max_outstanding_per_tenant
even higher. I've set mine to 4096 now.
But I'm afraid you can never avoid 'too many requests' completely. As far as I understand (still learning...), the more data you try to load, the more often you will hit this limit.
In my case, 'loading more data' is caused because in Grafana I want to view the whole 721 hours (30 days), or because I've crammed too much queries into one graph.
I'm still working on finding the right trade-off between memory-usage and speed. Below, you'll see my current partial configuration, relevant to this specific issue.
server:
http_listen_port: 3100
grpc_listen_port: 9096
# Read timeout for HTTP server
http_server_read_timeout: 3m
# Write timeout for HTTP server
http_server_write_timeout: 3m
query_range:
split_queries_by_interval: 0
parallelise_shardable_queries: false
querier:
max_concurrent: 2048
frontend:
max_outstanding_per_tenant: 4096
compress_responses: true
query_range: split_queries_by_interval: 0
This part seems to help.
I never ran into this issue with 2.4.1. Something changed in 2.4.2, I hope they restore the default values to what it was before.
For completeness, here's the needed config
query_range: split_queries_by_interval: 24h frontend: max_outstanding_per_tenant: 1024
This worked for my setup, thanks!
I can also confirm that on v2.4.2 you will face this issue if you keep new default value.
Switching value back to old default from version v2.4.1 solve my problem.
query_range:
split_queries_by_interval: 0
Bump, this is a serious issue. Please fix Loki team.
I'm not able to solve my problem using none of the above values/options on version 2.4.2. We rolled back our loki to version 2.4.1 and this solved our issue. Let's wait for the Loki team fix.
2.5.0 also have this problem
select {
case queue <- req:
q.queueLength.WithLabelValues(userID).Inc()
q.cond.Broadcast()
// Call this function while holding a lock. This guarantees that no querier can fetch the request before function returns.
if successFn != nil {
successFn()
}
return nil
//default:
// q.discardedRequests.WithLabelValues(userID).Inc()
// return ErrTooManyRequests
}
After removing this part of the code, the problem was alleviated
We got the same error with v2.5.0. None of the above options solved the issue so we rolled back to v2.4.1.
Is there an ETA for a fix?
i can confirm this issue exists after an upgrade to the newest version, i can't even roll back to 2.4.1, i may note that 2.4.1 uses v1beta tags and will not be available on gcp very soon
We also had a lot of "403 too many outstanding requests" on loki 2.5.0 and 2.4.2. Moved back to loki 2.4.1 and problem is gone.
@wuestkamp the issue is really that the 2.4.1 has security issues and is deprecated soon by the new k8 cluster versions
So why is grafana labs not fixing this issue? I don't understand. Why is it so hard?
@benisai i wish i knew ensure you are only using this in an isolated network the CVE's could lead to break ins and Grafana is a data pod with potentially lots of customer logs etc... don't endanger your company by running old versions
@benisai i wish i knew ensure you are only using this in an isolated network the CVE's could lead to break ins and Grafana is a data pod with potentially lots of customer logs etc... don't endanger your company by running old versions.
Homelab only. But still the issue persist without a fix. Or is there a fix?
I'm too lazy to set up a configuration file, so I just downgraded to 2.4.1 (homelab). I wish there was a way to configure Loki with environment variables. Configuration files are a pain.
Hi, I resolved the pb for my part by increasing two default value with:
querier: max_concurrent: 2048 query_scheduler: max_outstanding_requests_per_tenant: 2048
It's not perfect, but this error, helped me to understand better and better the architecture. Next step, used query_frontend (that is not mandatory but active if we add something in config) to do the job (Queueing) and of coure decrease as possible these values for my home docker service.
That's works form me with ansible
Hi, any updates ? Thanks for the info. But the problem still persist on Loki version 2.5.0
I increased both the values frontend.max_outstanding_per_tenant
and query_scheduler.max_outstanding_requests_per_tenant
to 4096
. I do not get any too many outstanding requests
errors anymore (Loki v2.4.2, tested in test cluster as well as production cluster).
query_scheduler:
max_outstanding_requests_per_tenant: 4096
frontend:
max_outstanding_per_tenant: 4096
query_range:
parallelise_shardable_queries: true
limits_config:
split_queries_by_interval: 15m
max_query_parallelism: 32
The default values for frontend.max_outstanding_per_tenant
and query_scheduler.max_outstanding_requests_per_tenant
are too low if you are using dashboards with multiple queries (multiple panels or multiple queries in one panel) over a longer time range because the queries will be split and will result in a lot of smaller sub-queries. Having multiple users using the same dashboard at the same time (or even only one user quickly refreshing the dashboard multiple times in a row) will further increase the count and you'll reach the limit even quicker.
This write-up really helped me understanding the query splitting and why there are so many queries:
https://taisho6339.gitbook.io/grafana-loki-deep-dive/query-process/split-a-query-into-someones
and
https://taisho6339.gitbook.io/grafana-loki-deep-dive/query-process/schedule-queries-to-queriers
@stefan-fast Thank you so much for your help. By these configurations, I can confirm that the issue is fixed on Loki versions 2.5.0 and 2.4.2.
This doesnt work with 2.6.2
So if i understand correctly, the issue is cause by the default settings of.
limits_config.max_query_parallelism = 32
limits_config.split_queries_by_interval = 30m
query_scheduler.max_outstanding_requests_per_tenant = 100
30min 32 gives you a time range of 16h, this is where max parallelism per query is reached. Now if a single dashboard spanning 16h runs 3 such queries at the same time , you already get 323 > 100, and get too many outstanding requests error? Same if several users run such queries/dashboards.
Would reducing the max_query_parallelism also help to avoid this issue?
yep, i still have this problem in v2.5.0
reporting in that I have this issue with 2.4.2, the dashboard works fine when I have one panel with 3 queries in it (they are relatively the same so they may actually be run as one query) but when I add another panel with only one query I get this error, although I'm using the one hour timeframe so it shouldnt be split into too many.
I could select multiple days before this version was introduced now more than 15 minutes will break my system if the dashboard has like 8-9 queries on it.
For completeness, here's the needed config
query_range: split_queries_by_interval: 24h frontend: max_outstanding_per_tenant: 1024
This worked for my setup, thanks!
The default values for frontend.max_outstanding_per_tenant and query_scheduler.max_outstanding_requests_per_tenant are too low if you are using dashboards with multiple queries (multiple panels or multiple queries in one panel) over a longer time range because the queries will be split and will result in a lot of smaller sub-queries.
I agree, these default values are too low, also I think we may have set our parallelism default values are too high.
A couple things changed in 2.4.x and later versions in Loki with the intent on making queries faster for all installations by enabling query parallelism by default.
There are currently two forms of query parallelism: splitting and sharding.
If you find yourself dealing with "too many outstanding requests" error, I would suggest increasing the limit:
If you are running the single binary or the SSD (read/write) modes you can change:
query_scheduler:
max_outstanding_requests_per_tenant: 10000
The value isn't super important, you can make it really big if you'd like. It primarily exists as a fairness measure in larger multi-tenant Loki clusters.
If you are running microservices, you need to see if you have the query scheduler deployed, if you do, set the value there (just like above), otherwise, if you have the frontend deployed, set the value on the frontend like so:
frontend:
max_outstanding_per_tenant: 10000
As an alternative, if you were happy with how Loki's performance was previously, you can disable query parallelism. This wouldn't be my choice but might make sense for small installations and in resource constrained environments:
query_range:
parallelise_shardable_queries: false
limits_config:
split_queries_by_interval: 0
is this problems solved?i get "too many outstanding request" on my grafana,and i dont know why
@reyyzzy
The cause of the issue is that parallelism has been enabled by default, but it limits the number of queries that you can queue at the same time to a low number by default. The best solution right now is to edit the max number of outstanding requests. In the future, we can hope for saner defaults.
For simple deployments (single-binary or SSD mode), add the following configuration:
query_scheduler:
max_outstanding_requests_per_tenant: 10000
If you deployed in microservices mode, use this config:
frontend:
max_outstanding_per_tenant: 10000
Issue present in loki 2.7.0 while using default values.
Issue present in loki 2.7.0 while using default values.
On 2.7.4 as well
Same in 2.8.0
None of the above configurations worked for me, version 2.8.0
For me the following were the only lines that I changed from the base config in the docker container and seem to work (so far):
querier:
max_concurrent: 100
frontend:
max_outstanding_per_tenant: 1024
scheduler_worker_concurrency: 20
I'm not throwing a huge amount at the server at the moment, but at least multiple panels in a dashboard load in a separate graphana instance that's pointing at it.
Hi, I resolved the pb for my part by increasing two default value with:
querier: max_concurrent: 2048 query_scheduler: max_outstanding_requests_per_tenant: 2048
It's not perfect, but this error, helped me to understand better and better the architecture. Next step, used query_frontend (that is not mandatory but active if we add something in config) to do the job (Queueing) and of coure decrease as possible these values for my home docker service.
It's works to me
@reyyzzy
The cause of the issue is that parallelism has been enabled by default, but it limits the number of queries that you can queue at the same time to a low number by default. The best solution right now is to edit the max number of outstanding requests. In the future, we can hope for saner defaults.
For simple deployments (single-binary or SSD mode), add the following configuration:
query_scheduler: max_outstanding_requests_per_tenant: 10000
If you deployed in microservices mode, use this config:
frontend: max_outstanding_per_tenant: 10000
The first works to me in Loki v2.8.2 for the binary deployment.
Hi, I resolved the pb for my part by increasing two default value with: querier: max_concurrent: 2048 query_scheduler: max_outstanding_requests_per_tenant: 2048 It's not perfect, but this error, helped me to understand better and better the architecture. Next step, used query_frontend (that is not mandatory but active if we add something in config) to do the job (Queueing) and of coure decrease as possible these values for my home docker service.
That's works form me with ansible
- name: Create loki service tags: grafana docker_container: name: loki restart_policy: always image: "grafana/loki:2.5.0" log_driver: syslog log_options: tag: lokilog networks:
- name: "loki" command: "-config.file=/etc/loki/local-config.yaml -querier.max-outstanding-requests-per-tenant=2048 -querier.max-concurrent=2048"
Adding the configuration using command is works for me in Loki v2.6.1 installed using helm
extraArgs:
querier.max-outstanding-requests-per-tenant: "2048"
querier.max-concurrent: "2048"
The following seems to work for the chart version 5.6.2 and app version 2.8.2 for the grafana/loki helm chart.
loki:
limits_config:
split_queries_by_interval: 24h
max_query_parallelism: 100
query_scheduler:
max_outstanding_requests_per_tenant: 4096
frontend:
max_outstanding_per_tenant: 4096
# other stuff...
If you are not deploying Loki via Helm, i believe you have to set these values not under the "loki:" key but on top level directly into the config.
If the defaults wont be changed, i guess this issue can be closed.
@litvinav im highly against closing this, sensible defaults is something every software should have. If you install an ingress it also works out of the box and you can configure it on top of it. That makes adoption for beginners easier.
In order to break the default you need only to select about 5 datasets and you will get a 429 (this isn't a complex screen or anything)
So what is the final decision? IMHO, config generator with server(s) hardware input would be great.
querier: max_concurrent: 2048
That's a big number relative to the default of 10. Might need to watch resource consumption.
# The maximum number of concurrent queries allowed.
# CLI flag: -querier.max-concurrent
[max_concurrent: <int> | default = 10]
No one of above solutions works, only revert to 2.4.1 finally fixed this dreaded issue.
@msveshnikov Don't run old versions you put your clusters at risk!
I have a good solution from another issue that was closed. Apparently the problem is with the parallel queries going on. The following configuration worked for me for the latest build.
Solution was found here : https://github.com/grafana/loki/issues/4613#issuecomment-1045993131
config:
query_scheduler:
max_outstanding_requests_per_tenant: 2048
query_range:
parallelise_shardable_queries: false
split_queries_by_interval: 0
Describe the bug
After upgrading to v2.4.2 from v2.4.1, none of the panels using loki show any data. I have a dashboard with 4 panels that load data from loki. I am able to see data ingested correctly with grafana explore datasource query.
Environment
Using loki with docker-compose and shipping docker logs with loki driver.
loki.yml
error on the grafana panel