grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
24.1k stars 3.47k forks source link

error notifying frontend/scheduler about finished query #7649

Open tasiotas opened 2 years ago

tasiotas commented 2 years ago

Hi,

I am getting a lot of those errors:

loki-1  | level=error ts=2022-11-09T16:43:52.196413829Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
loki-1  | level=error ts=2022-11-09T16:43:52.196413176Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
loki-1  | level=error ts=2022-11-09T16:43:52.196428849Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
loki-1  | level=error ts=2022-11-09T16:43:52.196462256Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
loki-1  | level=error ts=2022-11-09T16:43:52.196970284Z caller=scheduler_processor.go:182 org_id=fake msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled" frontend=0.0.0.0:9096
loki-1  | level=error ts=2022-11-09T16:43:52.197027425Z caller=scheduler_processor.go:137 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=127.0.0.1:9096
loki-1  | level=error ts=2022-11-09T16:43:52.197058086Z caller=scheduler_processor.go:182 org_id=fake msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled" frontend=0.0.0.0:9096
loki-1  | level=error ts=2022-11-09T16:43:52.197084426Z caller=scheduler_processor.go:137 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=127.0.0.1:9096
loki-1  | level=error ts=2022-11-09T16:43:52.197793087Z caller=scheduler_processor.go:182 org_id=fake msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled" frontend=0.0.0.0:9096
loki-1  | level=error ts=2022-11-09T16:43:52.197823878Z caller=scheduler_processor.go:182 org_id=fake msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled" frontend=0.0.0.0:9096
loki-1  | level=error ts=2022-11-09T16:43:52.197833438Z caller=scheduler_processor.go:137 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=127.0.0.1:9096
loki-1  | level=error ts=2022-11-09T16:43:52.197837788Z caller=scheduler_processor.go:137 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=127.0.0.1:9096
loki-1  | level=error ts=2022-11-09T16:43:52.197860028Z caller=scheduler_processor.go:182 org_id=fake msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled" frontend=0.0.0.0:9096
loki-1  | level=error ts=2022-11-09T16:43:52.197879088Z caller=scheduler_processor.go:137 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=127.0.0.1:9096

image

I found similar issue about related to frontend address, I added it to config, but didnt help

frontend:
  address: 0.0.0.0

Here is my docker-compose.yml

  loki:
    image: grafana/loki:2.6.1
    user: root
    volumes:
      - ./Docker/compose/local/loki:/etc/loki
      - loki_data:/home/loki/data
    ports:
      - 3100:3100
      - 9096:9096
    restart: unless-stopped
    command: -config.file=/etc/loki/loki-config.yml

loki-config.yml based on complete-local-config.yaml from Docs

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  log_level: warn

frontend:
  address: 0.0.0.0

common:
  path_prefix: /home/loki/data
  storage:
    filesystem:
      chunks_directory: /home/loki/data/chunks
      rules_directory: /home/loki/data/rules
  replication_factor: 1
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

query_scheduler:
  max_outstanding_requests_per_tenant: 1000

Any ideas what is causing it? Thank you

christertime commented 10 months ago

Same issue with 2.9.2

0xForerunner commented 10 months ago

what's going on with this?

hannibal85 commented 10 months ago

+1

magcode commented 10 months ago

+1

waney316 commented 10 months ago

+1, Is there a solution of the issue?

gelato commented 10 months ago

Cool seeing how devs ignore this happening for almost a year. Tells a lot about this software's quality. Same issue here btw, Loki chart 5.41.7

rgarcia6520 commented 10 months ago

I had this issue on 2 separate occasions, first was single monolithic container setup where promtail streams from logs were overloading loki, addressed in my promtail configs.

Second time was in production install with the simple scalable chart and ensured these recommendations for istio injection were followed and NPs weren't blocking pod to pod comms and this error went away. https://grafana.com/docs/loki/latest/setup/install/istio/

raypet-cillco commented 10 months ago

Cool seeing how devs ignore this happening for almost a year. Tells a lot about this software's quality. Same issue here btw, Loki chart 5.41.7

The devs have 1.2k issues on their hands. Most releases seem security related. I guess they have too much on their plate right now, and that doesn't necessarily mean that there's something wrong with the software's quality even if such issues as these are very bothersome.

danielwhatmuff commented 10 months ago

+1, also seeing this but everything working fine

kingindanord commented 10 months ago

+1, loki 2.9.3

DJLebedev commented 10 months ago

I have the same issue. And set

query_range:
  parallelise_shardable_queries: false
  split_queries_by_interval: 0s

does not help me.

dorinand commented 9 months ago

+1, Loki 2.8.3, docker swarm, monolithic.

rakazak0 commented 9 months ago

@dorinand https://github.com/grafana/loki/issues/4699

rakazak0 commented 9 months ago

In config file:

frontend:
  address: {{ FRONTEND_SVC_NAME }}

helped in my case

bioszombie commented 9 months ago

frontend:

frontend:
  address: {{ FRONTEND_SVC_NAME }}

helped in my case

Is this in the config file? The JSON report or other?

rakazak0 commented 9 months ago

@bioszombie its in the config file

sojjan1337 commented 7 months ago

+1

banschikovde commented 7 months ago

+1

citrocity commented 7 months ago

jeez, the amount of +1. i get devs have 1000+ issues but if everyone is complaining for 1.5 years that means something doesnt work. It is the same issue but without having any oom.

markzporter commented 7 months ago

+1

jodykpw commented 7 months ago

This issue creates numerous log events, which, if not managed properly, will eventually fill up your disk unless you set retention policies etc...

markzporter commented 7 months ago

In config file:

frontend:
  address: {{ FRONTEND_SVC_NAME }}

helped in my case

This also helped me! Thanks @rakazak0. Was deployed in docker swarm.

https://github.com/grafana/loki/issues/4699#issuecomment-977843834

CarliJoy commented 6 months ago

In the documentation address is not defined.

So I am wondering why this is helping and what it does.

Also I am not deploying with containers. I have no idea what FRONTEND_SVC_NAME is. Google just list 4 results, none is helping.

DavidYuYu commented 6 months ago

I have solved the problem After looking at the source code and debug logs, I found that this problem generally occurs in multi network card environments. loki will select the first network card by default, but is maybe not right

level=debug` ts=2024-05-29T08:18:14.635795694Z caller=scheduler.go:332 msg="frontend connected" address=147.135.97.120:9096

code in pkg/lokifrontend/frontend/config.go:41 func InitFrontend(cfg CombinedFrontendConfig, ring ring.ReadRing, limits v1.Limits, grpcListenPort int, log log.Logger, reg prometheus.Registerer) (http.RoundTripper, *v1.Frontend, *v2.Frontend, error) {

Therefore, we need to explicitly set the address parameter under the frontend parameter, so that the frontend can maintain normal communication

frontend: address: 192.168.0.1

level=debug ts=2024-05-29T08:38:18.364993176Z caller=scheduler.go:332 msg="frontend connected" address=192.168.0.1:9096

usermakename commented 6 months ago

i'v moved to elastic from this software cause of this errors and no retention param at configuration, sorry

datkieu007 commented 6 months ago

+1 loki 2.9.8 same issue

yalattas commented 6 months ago

same issue with SimpleScalable

Espe0n commented 5 months ago

Also occurs on Loki v3.0.0

caller=scheduler_processor.go:102 component=querier msg="error contacting scheduler" err="rpc error: code = Canceled desc = context canceled" addr=<redacted>:9095

Although we still get an usable Result, the amount of queryable Documents on a timeframe differ by 10% to the Elasticsearch-Instance. Not sure if this error is the Reason.

LukoJy3D commented 5 months ago

Running helm chart with most of the defaults in SSD mode

appVersion: 3.0.0
version: 6.6.1

All queries are served properly with no errors in UI, however Loki continues spitting these errors:

level=error ts=2024-06-12T08:32:56.588314492Z caller=scheduler_processor.go:174 component=querier org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=10.28.29.198:9095

level=error ts=2024-06-12T08:32:56.587980622Z caller=errors.go:26 org_id=fake message="closing iterator" error="context canceled"

level=error ts=2024-06-12T08:32:56.531779119Z caller=scheduler_processor.go:174 component=querier org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=10.28.29.198:9095

level=error ts=2024-06-12T08:32:56.531562579Z caller=errors.go:26 org_id=fake message="closing iterator" error="context canceled"

level=error ts=2024-06-12T08:32:56.52766308Z caller=scheduler_processor.go:110 component=querier msg="error processing requests from scheduler" err="rpc error: code = Canceled desc = context canceled" addr=10.28.29.198:9095

level=error ts=2024-06-12T08:32:56.527461992Z caller=scheduler_processor.go:110 component=querier msg="error processing requests from scheduler" err="rpc error: code = Canceled desc = context canceled" addr=10.28.29.198:9095

level=error ts=2024-06-12T08:32:56.449343794Z caller=scheduler_processor.go:174 component=querier org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=10.28.15.0:9095

level=error ts=2024-06-12T08:32:56.448862704Z caller=errors.go:26 org_id=fake message="closing iterator" error="context canceled"

level=error ts=2024-06-12T08:32:56.405988504Z caller=scheduler_processor.go:110 component=querier msg="error processing requests from scheduler" err="rpc error: code = Canceled desc = context canceled" addr=10.28.15.0:9095

level=error ts=2024-06-12T08:32:08.028110542Z caller=scheduler_processor.go:110 component=querier msg="error processing requests from scheduler" err="rpc error: code = Canceled desc = context canceled" addr=10.28.29.198:9095

During this, I also see 100% error and 100s latency on /schedulerpb.SchedulerForQuerier/QuerierLoop

lianghuiyuan commented 4 months ago

+1111111111111111111111111111111

slim-bean commented 4 months ago

I was linked this issue today and scrolling through it i just want to say how sorry I am folks....

We get really tied up in a lot of things building Loki and sort of the irony of building a log database is that you don't mind when you generate a lot of logs.

Largely we've just been ignoring this because we know why it happens, and we've gotten really good at ignoring log lines that aren't important, in some ways I think this is one of the things about Loki I really like, just excluding stuff when I don't care about it.

But we failed to both fix this and/or communicate anything about this.

A recent PR included a change to check the context before we log this message which will make some of these messages go away. (context canceled)

The notifying scheduler problem still exists.

The TL;DR on the problem is: when you run a query looking for logs (not metrics) loki only needs 1000 lines to answer the query, but it will dispatch potentially thousands of subqueries to find 1000 lines. If the query matches a lot of lines then we get the result quickly and try to cancel the remaining requests, but many are already dispatched and running in parallel and canceling them is hard or doesn't happen fast enough which leads to a slew of errors.

Also not an excuse but to level with everyone because I can see the frustration here and that makes me feel bad, but we are struggling to keep up with the community because Loki continues to grow and we continue to have to really focus on finding/fixing the things that come up as we run Loki at bigger and bigger scales as part of our SaaS product.

I love that we have Loki OSS and I love how successful so many folks have been using it, but we do struggle to try to keep up in all directions at the same time. Our SaaS product funds us and allows us to continue to build Loki so it gets most of our attention, but we are still committed to providing Loki to everyone, I just wish we had more hours in the day to communicate with y'all.

Thank you to everyone that responded to this issue, including the +1's but especially anyone that adds additional context, I don't have a timeline on what we can do about the remaining errors, but if you can configure your agents to just drop them I would recommend this and hopefully we can have some improvements toon!

Vinaum8 commented 4 months ago

In config file:

frontend:
  address: {{ FRONTEND_SVC_NAME }}

helped in my case

This helped me too. My helm chart values:

    structuredConfig:
      frontend:
        address: "query_frontend_svc_name.namespace.svc."
emadolsky commented 3 months ago

+1

duj4 commented 3 months ago

Running helm chart with most of the defaults in SSD mode

appVersion: 3.0.0
version: 6.6.1

All queries are served properly with no errors in UI, however Loki continues spitting these errors:

level=error ts=2024-06-12T08:32:56.588314492Z caller=scheduler_processor.go:174 component=querier org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=10.28.29.198:9095

level=error ts=2024-06-12T08:32:56.587980622Z caller=errors.go:26 org_id=fake message="closing iterator" error="context canceled"

level=error ts=2024-06-12T08:32:56.531779119Z caller=scheduler_processor.go:174 component=querier org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=10.28.29.198:9095

level=error ts=2024-06-12T08:32:56.531562579Z caller=errors.go:26 org_id=fake message="closing iterator" error="context canceled"

level=error ts=2024-06-12T08:32:56.52766308Z caller=scheduler_processor.go:110 component=querier msg="error processing requests from scheduler" err="rpc error: code = Canceled desc = context canceled" addr=10.28.29.198:9095

level=error ts=2024-06-12T08:32:56.527461992Z caller=scheduler_processor.go:110 component=querier msg="error processing requests from scheduler" err="rpc error: code = Canceled desc = context canceled" addr=10.28.29.198:9095

level=error ts=2024-06-12T08:32:56.449343794Z caller=scheduler_processor.go:174 component=querier org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=10.28.15.0:9095

level=error ts=2024-06-12T08:32:56.448862704Z caller=errors.go:26 org_id=fake message="closing iterator" error="context canceled"

level=error ts=2024-06-12T08:32:56.405988504Z caller=scheduler_processor.go:110 component=querier msg="error processing requests from scheduler" err="rpc error: code = Canceled desc = context canceled" addr=10.28.15.0:9095

level=error ts=2024-06-12T08:32:08.028110542Z caller=scheduler_processor.go:110 component=querier msg="error processing requests from scheduler" err="rpc error: code = Canceled desc = context canceled" addr=10.28.29.198:9095

During this, I also see 100% error and 100s latency on /schedulerpb.SchedulerForQuerier/QuerierLoop

hi @LukoJy3D , have you solved this issue? I am using SSD (3.1.0) as well and the Grafana even failed to connect to Loki, the loki-read pods were complaining this, what I can do is restart the loki-read component.

duj4 commented 3 months ago

In config file:

frontend:
  address: {{ FRONTEND_SVC_NAME }}

helped in my case

This helped me too. My helm chart values:

    structuredConfig:
      frontend:
        address: "query_frontend_svc_name.namespace.svc."

hi @Vinaum8 In the helm chart for SSD mode, the frontend address is set as null:

  frontend:
    scheduler_address: '{{ include "loki.querySchedulerAddress" . }}'
    tail_proxy_url: '{{ include "loki.querierAddress" . }}'
  frontend_worker:
    scheduler_address: '{{ include "loki.querySchedulerAddress" . }}'

{{/* Determine query-scheduler address */}}
{{- define "loki.querySchedulerAddress" -}}
{{- $schedulerAddress := ""}}
{{- $isDistributed := eq (include "loki.deployment.isDistributed" .) "true" -}}
{{- if $isDistributed -}}
{{- $schedulerAddress = printf "%s.%s.svc.%s:%s" (include "loki.querySchedulerFullname" .) .Release.Namespace .Values.global.clusterDomain (.Values.loki.server.grpc_listen_port | toString) -}}
{{- end -}}
{{- printf "%s" $schedulerAddress }}
{{- end }}
jodykpw commented 2 months ago

I've addressed this by filtering out the relevant logs using Fluent Bit and have configured Docker log retention.