Open tasiotas opened 2 years ago
Same issue with 2.9.2
what's going on with this?
+1
+1
+1, Is there a solution of the issue?
Cool seeing how devs ignore this happening for almost a year. Tells a lot about this software's quality. Same issue here btw, Loki chart 5.41.7
I had this issue on 2 separate occasions, first was single monolithic container setup where promtail streams from logs were overloading loki, addressed in my promtail configs.
Second time was in production install with the simple scalable chart and ensured these recommendations for istio injection were followed and NPs weren't blocking pod to pod comms and this error went away. https://grafana.com/docs/loki/latest/setup/install/istio/
Cool seeing how devs ignore this happening for almost a year. Tells a lot about this software's quality. Same issue here btw, Loki chart 5.41.7
The devs have 1.2k issues on their hands. Most releases seem security related. I guess they have too much on their plate right now, and that doesn't necessarily mean that there's something wrong with the software's quality even if such issues as these are very bothersome.
+1, also seeing this but everything working fine
+1, loki 2.9.3
I have the same issue. And set
query_range:
parallelise_shardable_queries: false
split_queries_by_interval: 0s
does not help me.
+1, Loki 2.8.3, docker swarm, monolithic.
In config file:
frontend:
address: {{ FRONTEND_SVC_NAME }}
helped in my case
frontend:
frontend: address: {{ FRONTEND_SVC_NAME }}
helped in my case
Is this in the config file? The JSON report or other?
@bioszombie its in the config file
+1
+1
jeez, the amount of +1. i get devs have 1000+ issues but if everyone is complaining for 1.5 years that means something doesnt work. It is the same issue but without having any oom.
+1
This issue creates numerous log events, which, if not managed properly, will eventually fill up your disk unless you set retention policies etc...
In config file:
frontend: address: {{ FRONTEND_SVC_NAME }}
helped in my case
This also helped me! Thanks @rakazak0. Was deployed in docker swarm.
https://github.com/grafana/loki/issues/4699#issuecomment-977843834
In the documentation address
is not defined.
So I am wondering why this is helping and what it does.
Also I am not deploying with containers. I have no idea what FRONTEND_SVC_NAME
is. Google just list 4 results, none is helping.
I have solved the problem After looking at the source code and debug logs, I found that this problem generally occurs in multi network card environments. loki will select the first network card by default, but is maybe not right
level=debug` ts=2024-05-29T08:18:14.635795694Z caller=scheduler.go:332 msg="frontend connected" address=147.135.97.120:9096
code in pkg/lokifrontend/frontend/config.go:41
func InitFrontend(cfg CombinedFrontendConfig, ring ring.ReadRing, limits v1.Limits, grpcListenPort int, log log.Logger, reg prometheus.Registerer) (http.RoundTripper, *v1.Frontend, *v2.Frontend, error) {
Therefore, we need to explicitly set the address parameter under the frontend parameter, so that the frontend can maintain normal communication
frontend: address: 192.168.0.1
level=debug ts=2024-05-29T08:38:18.364993176Z caller=scheduler.go:332 msg="frontend connected" address=192.168.0.1:9096
i'v moved to elastic from this software cause of this errors and no retention param at configuration, sorry
+1 loki 2.9.8 same issue
same issue with SimpleScalable
Also occurs on Loki v3.0.0
caller=scheduler_processor.go:102 component=querier msg="error contacting scheduler" err="rpc error: code = Canceled desc = context canceled" addr=<redacted>:9095
Although we still get an usable Result, the amount of queryable Documents on a timeframe differ by 10% to the Elasticsearch-Instance. Not sure if this error is the Reason.
Running helm chart with most of the defaults in SSD mode
appVersion: 3.0.0
version: 6.6.1
All queries are served properly with no errors in UI, however Loki continues spitting these errors:
level=error ts=2024-06-12T08:32:56.588314492Z caller=scheduler_processor.go:174 component=querier org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=10.28.29.198:9095
level=error ts=2024-06-12T08:32:56.587980622Z caller=errors.go:26 org_id=fake message="closing iterator" error="context canceled"
level=error ts=2024-06-12T08:32:56.531779119Z caller=scheduler_processor.go:174 component=querier org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=10.28.29.198:9095
level=error ts=2024-06-12T08:32:56.531562579Z caller=errors.go:26 org_id=fake message="closing iterator" error="context canceled"
level=error ts=2024-06-12T08:32:56.52766308Z caller=scheduler_processor.go:110 component=querier msg="error processing requests from scheduler" err="rpc error: code = Canceled desc = context canceled" addr=10.28.29.198:9095
level=error ts=2024-06-12T08:32:56.527461992Z caller=scheduler_processor.go:110 component=querier msg="error processing requests from scheduler" err="rpc error: code = Canceled desc = context canceled" addr=10.28.29.198:9095
level=error ts=2024-06-12T08:32:56.449343794Z caller=scheduler_processor.go:174 component=querier org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=10.28.15.0:9095
level=error ts=2024-06-12T08:32:56.448862704Z caller=errors.go:26 org_id=fake message="closing iterator" error="context canceled"
level=error ts=2024-06-12T08:32:56.405988504Z caller=scheduler_processor.go:110 component=querier msg="error processing requests from scheduler" err="rpc error: code = Canceled desc = context canceled" addr=10.28.15.0:9095
level=error ts=2024-06-12T08:32:08.028110542Z caller=scheduler_processor.go:110 component=querier msg="error processing requests from scheduler" err="rpc error: code = Canceled desc = context canceled" addr=10.28.29.198:9095
During this, I also see 100% error and 100s latency on /schedulerpb.SchedulerForQuerier/QuerierLoop
+1111111111111111111111111111111
I was linked this issue today and scrolling through it i just want to say how sorry I am folks....
We get really tied up in a lot of things building Loki and sort of the irony of building a log database is that you don't mind when you generate a lot of logs.
Largely we've just been ignoring this because we know why it happens, and we've gotten really good at ignoring log lines that aren't important, in some ways I think this is one of the things about Loki I really like, just excluding stuff when I don't care about it.
But we failed to both fix this and/or communicate anything about this.
A recent PR included a change to check the context before we log this message which will make some of these messages go away. (context canceled)
The notifying scheduler problem still exists.
The TL;DR on the problem is: when you run a query looking for logs (not metrics) loki only needs 1000 lines to answer the query, but it will dispatch potentially thousands of subqueries to find 1000 lines. If the query matches a lot of lines then we get the result quickly and try to cancel the remaining requests, but many are already dispatched and running in parallel and canceling them is hard or doesn't happen fast enough which leads to a slew of errors.
Also not an excuse but to level with everyone because I can see the frustration here and that makes me feel bad, but we are struggling to keep up with the community because Loki continues to grow and we continue to have to really focus on finding/fixing the things that come up as we run Loki at bigger and bigger scales as part of our SaaS product.
I love that we have Loki OSS and I love how successful so many folks have been using it, but we do struggle to try to keep up in all directions at the same time. Our SaaS product funds us and allows us to continue to build Loki so it gets most of our attention, but we are still committed to providing Loki to everyone, I just wish we had more hours in the day to communicate with y'all.
Thank you to everyone that responded to this issue, including the +1's but especially anyone that adds additional context, I don't have a timeline on what we can do about the remaining errors, but if you can configure your agents to just drop them I would recommend this and hopefully we can have some improvements toon!
In config file:
frontend: address: {{ FRONTEND_SVC_NAME }}
helped in my case
This helped me too. My helm chart values:
structuredConfig:
frontend:
address: "query_frontend_svc_name.namespace.svc."
+1
Running helm chart with most of the defaults in SSD mode
appVersion: 3.0.0 version: 6.6.1
All queries are served properly with no errors in UI, however Loki continues spitting these errors:
level=error ts=2024-06-12T08:32:56.588314492Z caller=scheduler_processor.go:174 component=querier org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=10.28.29.198:9095 level=error ts=2024-06-12T08:32:56.587980622Z caller=errors.go:26 org_id=fake message="closing iterator" error="context canceled" level=error ts=2024-06-12T08:32:56.531779119Z caller=scheduler_processor.go:174 component=querier org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=10.28.29.198:9095 level=error ts=2024-06-12T08:32:56.531562579Z caller=errors.go:26 org_id=fake message="closing iterator" error="context canceled" level=error ts=2024-06-12T08:32:56.52766308Z caller=scheduler_processor.go:110 component=querier msg="error processing requests from scheduler" err="rpc error: code = Canceled desc = context canceled" addr=10.28.29.198:9095 level=error ts=2024-06-12T08:32:56.527461992Z caller=scheduler_processor.go:110 component=querier msg="error processing requests from scheduler" err="rpc error: code = Canceled desc = context canceled" addr=10.28.29.198:9095 level=error ts=2024-06-12T08:32:56.449343794Z caller=scheduler_processor.go:174 component=querier org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=10.28.15.0:9095 level=error ts=2024-06-12T08:32:56.448862704Z caller=errors.go:26 org_id=fake message="closing iterator" error="context canceled" level=error ts=2024-06-12T08:32:56.405988504Z caller=scheduler_processor.go:110 component=querier msg="error processing requests from scheduler" err="rpc error: code = Canceled desc = context canceled" addr=10.28.15.0:9095 level=error ts=2024-06-12T08:32:08.028110542Z caller=scheduler_processor.go:110 component=querier msg="error processing requests from scheduler" err="rpc error: code = Canceled desc = context canceled" addr=10.28.29.198:9095
During this, I also see 100% error and 100s latency on
/schedulerpb.SchedulerForQuerier/QuerierLoop
hi @LukoJy3D , have you solved this issue? I am using SSD (3.1.0) as well and the Grafana even failed to connect to Loki, the loki-read pods were complaining this, what I can do is restart the loki-read component.
In config file:
frontend: address: {{ FRONTEND_SVC_NAME }}
helped in my case
This helped me too. My helm chart values:
structuredConfig: frontend: address: "query_frontend_svc_name.namespace.svc."
hi @Vinaum8 In the helm chart for SSD mode, the frontend address is set as null:
frontend:
scheduler_address: '{{ include "loki.querySchedulerAddress" . }}'
tail_proxy_url: '{{ include "loki.querierAddress" . }}'
frontend_worker:
scheduler_address: '{{ include "loki.querySchedulerAddress" . }}'
{{/* Determine query-scheduler address */}}
{{- define "loki.querySchedulerAddress" -}}
{{- $schedulerAddress := ""}}
{{- $isDistributed := eq (include "loki.deployment.isDistributed" .) "true" -}}
{{- if $isDistributed -}}
{{- $schedulerAddress = printf "%s.%s.svc.%s:%s" (include "loki.querySchedulerFullname" .) .Release.Namespace .Values.global.clusterDomain (.Values.loki.server.grpc_listen_port | toString) -}}
{{- end -}}
{{- printf "%s" $schedulerAddress }}
{{- end }}
I've addressed this by filtering out the relevant logs using Fluent Bit and have configured Docker log retention.
Hi,
I am getting a lot of those errors:
I found similar issue about related to frontend address, I added it to config, but didnt help
Here is my docker-compose.yml
loki-config.yml based on complete-local-config.yaml from Docs
Any ideas what is causing it? Thank you