grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.9k stars 3.45k forks source link

error notifying frontend/scheduler about finished query #7649

Open tasiotas opened 2 years ago

tasiotas commented 2 years ago

Hi,

I am getting a lot of those errors:

loki-1  | level=error ts=2022-11-09T16:43:52.196413829Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
loki-1  | level=error ts=2022-11-09T16:43:52.196413176Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
loki-1  | level=error ts=2022-11-09T16:43:52.196428849Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
loki-1  | level=error ts=2022-11-09T16:43:52.196462256Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
loki-1  | level=error ts=2022-11-09T16:43:52.196970284Z caller=scheduler_processor.go:182 org_id=fake msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled" frontend=0.0.0.0:9096
loki-1  | level=error ts=2022-11-09T16:43:52.197027425Z caller=scheduler_processor.go:137 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=127.0.0.1:9096
loki-1  | level=error ts=2022-11-09T16:43:52.197058086Z caller=scheduler_processor.go:182 org_id=fake msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled" frontend=0.0.0.0:9096
loki-1  | level=error ts=2022-11-09T16:43:52.197084426Z caller=scheduler_processor.go:137 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=127.0.0.1:9096
loki-1  | level=error ts=2022-11-09T16:43:52.197793087Z caller=scheduler_processor.go:182 org_id=fake msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled" frontend=0.0.0.0:9096
loki-1  | level=error ts=2022-11-09T16:43:52.197823878Z caller=scheduler_processor.go:182 org_id=fake msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled" frontend=0.0.0.0:9096
loki-1  | level=error ts=2022-11-09T16:43:52.197833438Z caller=scheduler_processor.go:137 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=127.0.0.1:9096
loki-1  | level=error ts=2022-11-09T16:43:52.197837788Z caller=scheduler_processor.go:137 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=127.0.0.1:9096
loki-1  | level=error ts=2022-11-09T16:43:52.197860028Z caller=scheduler_processor.go:182 org_id=fake msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled" frontend=0.0.0.0:9096
loki-1  | level=error ts=2022-11-09T16:43:52.197879088Z caller=scheduler_processor.go:137 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=127.0.0.1:9096

image

I found similar issue about related to frontend address, I added it to config, but didnt help

frontend:
  address: 0.0.0.0

Here is my docker-compose.yml

  loki:
    image: grafana/loki:2.6.1
    user: root
    volumes:
      - ./Docker/compose/local/loki:/etc/loki
      - loki_data:/home/loki/data
    ports:
      - 3100:3100
      - 9096:9096
    restart: unless-stopped
    command: -config.file=/etc/loki/loki-config.yml

loki-config.yml based on complete-local-config.yaml from Docs

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  log_level: warn

frontend:
  address: 0.0.0.0

common:
  path_prefix: /home/loki/data
  storage:
    filesystem:
      chunks_directory: /home/loki/data/chunks
      rules_directory: /home/loki/data/rules
  replication_factor: 1
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

query_scheduler:
  max_outstanding_requests_per_tenant: 1000

Any ideas what is causing it? Thank you

aned commented 2 years ago

Experiencing the same issue, pretty much the same config, the debug level logs:

  |   | 2022-11-13 15:17:19 | Nov 14 09:47:18 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:18.984620841Z caller=mock.go:186 msg="Get - deadline exceeded" key=collectors/distributor
  |   | 2022-11-13 15:17:19 | Nov 14 09:47:18 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:18.984615691Z caller=mock.go:150 msg=Get key=collectors/scheduler wait_index=14
  |   | 2022-11-13 15:17:19 | Nov 14 09:47:18 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:18.984607606Z caller=mock.go:186 msg="Get - deadline exceeded" key=collectors/scheduler
  |   | 2022-11-13 15:17:19 | Nov 14 09:47:18 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:18.984601256Z caller=mock.go:150 msg=Get key=collectors/compactor wait_index=16
  |   | 2022-11-13 15:17:19 | Nov 14 09:47:18 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:18.984594181Z caller=mock.go:186 msg="Get - deadline exceeded" key=collectors/compactor
  |   | 2022-11-13 15:17:19 | Nov 14 09:47:18 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:18.984583804Z caller=mock.go:150 msg=Get key=collectors/ring wait_index=17
  |   | 2022-11-13 15:17:19 | Nov 14 09:47:18 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:18.984529071Z caller=mock.go:186 msg="Get - deadline exceeded" key=collectors/ring
  |   | 2022-11-13 15:17:18 | Nov 14 09:47:18 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:18.714265066Z caller=logging.go:76 traceID=3e788abb64e0e1cc orgID=fake msg="POST /loki/api/v1/push (204) 538.586µs"
  |   | 2022-11-13 15:17:18 | Nov 14 09:47:18 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:18.714180031Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=121.717µs msg="gRPC (success)"
  |   | 2022-11-13 15:17:18 | Nov 14 09:47:18 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:18.713825407Z caller=push.go:150 org_id=fake msg="push request parsed" path=/loki/api/v1/push contentType=application/x-protobuf contentEncoding= bodySize="1.5 kB" streams=3 entries=33 streamLabelsSize="134 B" entriesSize="5.5 kB" totalSize="5.7 kB" mostRecentLagMs=507
  |   | 2022-11-13 15:17:18 | Nov 14 09:47:17 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:17.983745696Z caller=mock.go:150 msg=Get key=collectors/ring wait_index=17
  |   | 2022-11-13 15:17:18 | Nov 14 09:47:17 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:17.98374076Z caller=mock.go:186 msg="Get - deadline exceeded" key=collectors/ring
  |   | 2022-11-13 15:17:18 | Nov 14 09:47:17 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:17.983734963Z caller=mock.go:150 msg=Get key=collectors/distributor wait_index=15
  |   | 2022-11-13 15:17:18 | Nov 14 09:47:17 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:17.983730374Z caller=mock.go:186 msg="Get - deadline exceeded" key=collectors/distributor
  |   | 2022-11-13 15:17:18 | Nov 14 09:47:17 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:17.983726097Z caller=mock.go:150 msg=Get key=collectors/scheduler wait_index=14
  |   | 2022-11-13 15:17:18 | Nov 14 09:47:17 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:17.983721069Z caller=mock.go:186 msg="Get - deadline exceeded" key=collectors/scheduler
  |   | 2022-11-13 15:17:18 | Nov 14 09:47:17 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:17.983713093Z caller=mock.go:150 msg=Get key=collectors/compactor wait_index=16
  |   | 2022-11-13 15:17:18 | Nov 14 09:47:17 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:17.983672895Z caller=mock.go:186 msg="Get - deadline exceeded" key=collectors/compactor
  |   | 2022-11-13 15:17:17 | Nov 14 09:47:17 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:17.613682421Z caller=logging.go:76 traceID=0b69d212465ca9fd orgID=fake msg="POST /loki/api/v1/push (204) 517.968µs"
  |   | 2022-11-13 15:17:17 | Nov 14 09:47:17 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:17.613572175Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=76.721µs msg="gRPC (success)"
  |   | 2022-11-13 15:17:17 | Nov 14 09:47:17 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:17.613234883Z caller=push.go:150 org_id=fake msg="push request parsed" path=/loki/api/v1/push contentType=application/x-protobuf contentEncoding= bodySize="1.5 kB" streams=3 entries=33 streamLabelsSize="134 B" entriesSize="5.5 kB" totalSize="5.7 kB" mostRecentLagMs=410

Tried adding wal config as per #2753 with no luck

ingester:
  wal:
    enabled: true
    dir: /loki/wal
lswith commented 1 year ago

I'm getting the same issue.

timbuchinger commented 1 year ago

We are running into the same issue with Loki hosted on Kubernetes and using Azure Blob storage.

The following is repeated in the logs:

level=error ts=2022-11-30T17:59:50.205452691Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
level=error ts=2022-11-30T17:59:50.205437491Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
level=error ts=2022-11-30T17:59:50.205424091Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
level=error ts=2022-11-30T17:59:50.205411191Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
level=error ts=2022-11-30T17:59:50.20539419Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
level=error ts=2022-11-30T17:59:50.20535119Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
level=error ts=2022-11-30T17:59:44.735217069Z caller=scheduler_processor.go:137 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=[redacted]:9095
level=error ts=2022-11-30T17:59:44.735171068Z caller=scheduler_processor.go:182 org_id=fake msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled" frontend=[redacted]:9095
level=error ts=2022-11-30T17:59:44.734160865Z caller=scheduler_processor.go:137 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=[redacted]:9095
level=error ts=2022-11-30T17:59:44.734131164Z caller=scheduler_processor.go:182 org_id=fake msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled" frontend=[redacted]:9095
level=error ts=2022-11-30T17:59:44.734013755Z caller=scheduler_processor.go:137 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=[redacted]:9095
level=error ts=2022-11-30T17:59:44.733509056Z caller=scheduler_processor.go:137 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=[redacted]:9095
level=error ts=2022-11-30T17:59:44.733488356Z caller=scheduler_processor.go:182 org_id=fake msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled" frontend=[redacted]:9095
level=error ts=2022-11-30T17:59:44.733398855Z caller=scheduler_processor.go:137 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=[redacted]:9095
level=error ts=2022-11-30T17:59:44.733297354Z caller=scheduler_processor.go:182 org_id=fake msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled" frontend=[redacted]:9095
level=error ts=2022-11-30T17:59:44.731098021Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
level=error ts=2022-11-30T17:59:44.731078221Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
level=error ts=2022-11-30T17:59:44.73101162Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
level=error ts=2022-11-30T17:59:44.730870719Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
level=error ts=2022-11-30T17:59:44.730831518Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"

This is a dev environment, so I am open to making any suggested changes to resolve this.

klu4ic commented 1 year ago

I have same issue helm chart: loki-stack 2.8.7 loki: v2.6.1

mh3th commented 1 year ago

I have same issue

jodykpw commented 1 year ago

I have same issue, even the default -config.file=/etc/loki/local-config.yaml file anyone know who to write promtail config pipeline_stages to drop those message?

i.e msg="error notifying scheduler about finished query"

Thank

tasiotas commented 1 year ago

I wonder if that PR is supposed to fix this issue https://github.com/grafana/loki/pull/7978

jodykpw commented 1 year ago

I wonder if that PR is supposed to fix this issue #7978

https://raw.githubusercontent.com/grafana/loki/v2.7.0/production/docker-compose.yaml

even this basic example have this issue. without doing any custom config.

jodykpw commented 1 year ago

I wonder if that PR is supposed to fix this issue #7978

https://raw.githubusercontent.com/grafana/loki/v2.7.0/production/docker-compose.yaml

even this basic example have this issue. without doing any custom config.

Oh i using version 2.7.1

alecks3474 commented 1 year ago

Same issue with Loki version 2.7.3

msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled"
msg="error notifying scheduler about finished query" err=EOF
alecks3474 commented 1 year ago

Same issue with Loki version 2.7.3

msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled"
msg="error notifying scheduler about finished query" err=EOF

If I set query_range.parallelise_shardable_queries to false (Cf. grafana.slack.com) , I no longer have notifying frontend/scheduler errors.

But this warning appears:

`msg="max concurrency is not evenly divisible across targets, adding an extra connection"`
mani76 commented 1 year ago

Same also here

fmarrero commented 1 year ago
limits_config:
  ingestion_rate_mb: 500
  retention_period: 30d
  per_stream_rate_limit: 512M
  per_stream_rate_limit_burst: 1024M
  max_query_series: 99999999
  query_timeout: 5m

querier:
  query_timeout: 5m
  engine:
    timeout: 5m

tried

 /opt/loki/loki -target=all,table-manager -config.file=/opt/loki/loki-local-config.yaml -querier.engine.timeout=5m

in the service

in grafana.ini

[dataproxy]
timeout = 600

still timing out in 3 minutes

FIX FOR ME:

Finally I realized that in the data source config for loki (which you create in Grafana), a timeout can be set there which seems to override the grafana data proxy setting. Mine was set to 180 there and bumping it up allowed me to extend my query time for the data source. Hope this helps someone.

nrm21 commented 1 year ago

+1 Same issue:

Doing the above (grafana data source timeout) didn't seem to help for me. But I noticed on mine it kept restarting the pod on the same node... once I cordoned off that node and deleted pod (to force restart elsewhere) the issue went away. Doesn't make a huge amount of sense but I thought I'd share in case it adds a clue to what is going on.

Maybe the new loki node location forced some sort of networking reset on the promtail daemonset that were trying to connect to it? I dunno, grasping at straws here.

slim-bean commented 1 year ago

EOF errors are often indicative of something running out of memory and OOM crashing.

I suspect for most of the examples here your frontend pods are OOM crashing on queries.

This can happen for a few reasons, typically it's a logs query (metric queries return samples and not log lines so it's harder to OOM a frontend with them, but it's not impossible)

Run more frontends, or increase their memory limits, this is typically how you work around it.

There was a change made recently however to help with one subset of cases we saw this happening a lot but it would really only affect anyone querying with just label matchers {job="foo"} with nothing else (no filters) where we too aggressively were parallelizing these queries and if the label selector matches enough data (typically TB a day for the streams) you could really thrash the frontends.

That change isn't in a release yet, but hopefully we'll have a release in a few weeks.

Daryes commented 1 year ago

I thing the problem might be more related to loki default values and not what the frontend configuration would be, especially when there's none.

With a config similar as the op, on a test environment with one server having loki v2.7.4 +grafana v9.3.1 and promtail v2.7.4 scanning its own /var/log dir, I'm getting hundredth of lines of error like theses

Mar  4 12:45:05 prom loki[66304]: level=error ts=2023-03-04T11:45:04.909077383Z caller=scheduler_processor.go:137 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=127.0.0.1:3196
Mar  4 12:45:05 prom loki[66304]: level=error ts=2023-03-04T11:45:04.909093885Z caller=scheduler_processor.go:182 org_id=fake msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled" frontend=192.168.1.12:3196

It will occur non-stop when having a single log panel in grafana connected to loki with only the filter job="varlogs"

As the 192.168.1.12:3196 is the gRPC listen port, my question is : why Loki needs to connect to its own gRPC server and failling miserably ?

For the record, here are my server and frontend sections from curl http://localhost:3100/config Please note I don't have a frontend section in loki config file, hence address="" and the port=0

server:
  http_listen_network: tcp
  http_listen_address: 0.0.0.0
  http_listen_port: 3100
  http_listen_conn_limit: 0
  grpc_listen_network: tcp
  grpc_listen_address: 0.0.0.0
  grpc_listen_port: 3196
  grpc_listen_conn_limit: 0
  tls_cipher_suites: ""
  tls_min_version: ""
  http_tls_config:
    cert_file: ""
    key_file: ""
    client_auth_type: ""
    client_ca_file: ""
  grpc_tls_config:
    cert_file: ""
    key_file: ""
    client_auth_type: ""
    client_ca_file: ""
  register_instrumentation: true
  graceful_shutdown_timeout: 30s
  http_server_read_timeout: 30s
  http_server_write_timeout: 30s
  http_server_idle_timeout: 2m0s
  grpc_server_max_recv_msg_size: 4194304
  grpc_server_max_send_msg_size: 4194304
  grpc_server_max_concurrent_streams: 100
  grpc_server_max_connection_idle: 2562047h47m16.854775807s
  grpc_server_max_connection_age: 2562047h47m16.854775807s
  grpc_server_max_connection_age_grace: 2562047h47m16.854775807s
  grpc_server_keepalive_time: 2h0m0s
  grpc_server_keepalive_timeout: 20s
  grpc_server_min_time_between_pings: 10s
  grpc_server_ping_without_stream_allowed: true
  log_format: logfmt
  log_level: warn
  log_source_ips_enabled: false
  log_source_ips_header: ""
  log_source_ips_regex: ""
  log_request_at_info_level_enabled: false
  http_path_prefix: ""
(...)
frontend:
  log_queries_longer_than: 0s
  max_body_size: 10485760
  query_stats_enabled: false
  max_outstanding_per_tenant: 2048
  querier_forget_delay: 0s
  scheduler_address: ""
  scheduler_dns_lookup_period: 10s
  scheduler_worker_concurrency: 5
  grpc_client_config:
    max_recv_msg_size: 104857600
    max_send_msg_size: 104857600
    grpc_compression: ""
    rate_limit: 0
    rate_limit_burst: 0
    backoff_on_ratelimits: false
    backoff_config:
      min_period: 100ms
      max_period: 10s
      max_retries: 10
    tls_enabled: false
    tls_cert_path: ""
    tls_key_path: ""
    tls_ca_path: ""
    tls_server_name: ""
    tls_insecure_skip_verify: false
    tls_cipher_suites: ""
    tls_min_version: ""
  instance_interface_names:
  - ens32
  - lo
  address: ""
  port: 0
  compress_responses: false
  downstream_url: ""
  tail_proxy_url: ""
  tail_tls_config:
    tls_cert_path: ""
    tls_key_path: ""
    tls_ca_path: ""
    tls_server_name: ""
    tls_insecure_skip_verify: false
    tls_cipher_suites: ""
    tls_min_version: ""
ivanahuckova commented 1 year ago

@usmangt I removed Observability Logs squad as this is related to Loki, not Loki data source in Grafana.

edisonX-sudo commented 1 year ago

+1 same issue

MaxDiOrio commented 1 year ago

Same issue here - no pods are getting OOM killed and everything still seems to be working properly.

mellieA commented 1 year ago

@slim-bean can you link the possibly related issue for users to follow?

tasiotas commented 1 year ago

Just to point out, as far as I am aware, I'm running Loki in Monolithic mode, not sure if there is any frontend service running. Also I don't have any frontend configuration in my loki-config.yml, so it's picking up defaults.

It should be easily reproducible with my docker-compose file.

kpinarci commented 1 year ago

I have the same issue, I am using Loki in the Version 2.7.4 as a Docker Container (Monolithic mode). Setting the Server timeout did not help. There is enough RAM available, which is not needed.

My Loki config

auth_enabled: false

server:
  http_listen_port: {{ loki_port }}
  http_server_read_timeout: 120s
  http_server_write_timeout: 120s
  log_level: {{ loki_log_level }}

ingester:
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
    final_sleep: 0s
  chunk_idle_period: 30m       # Any chunk not receiving new logs in this time will be flushed
  max_chunk_age: 1h           # All chunks will be flushed when they hit this age, default is 1h
  chunk_target_size: 1048576  # Loki will attempt to build chunks up to 1.5MB, flushing first if chunk_idle_period or max_chunk_age is reached first
  chunk_retain_period: 5m    # Must be greater than index read cache TTL if using an index cache (Default index read cache TTL is 5m)
  max_transfer_retries: 0     # Chunk transfers disabled

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: s3
      schema: v11
      index:
        prefix: index_
        period: 24h      #The index period must be 24h

storage_config:
  boltdb_shipper:
    active_index_directory: /data/loki/boltdb-shipper-active
    cache_location: /data/loki/boltdb-shipper-cache
    cache_ttl: 1m         # Can be increased for faster performance over longer query periods, uses more disk space
    shared_store: s3
  aws:
   s3: {{ loki_s3_url }}
   s3forcepathstyle: true

compactor:
  working_directory: /data/loki/boltdb-shipper-compactor
  shared_store: s3
  compaction_interval: {{ loki_compaction_interval }}
  retention_enabled: {{ loki_retention_enabled }}            
  retention_delete_delay: {{ loki_retention_delete_delay }}        
  retention_delete_worker_count: {{ loki_retention_delete_worker_count }}

limits_config:
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  ingestion_burst_size_mb: 128 
  ingestion_rate_mb: 64 
  #max_streams_per_user: 0 
  retention_period: {{ loki_retention_global_period }}
  retention_stream:
  - selector: '{{ loki_selector_dev }}'
    priority: {{ loki_selector_dev_priority }}
    period: {{ loki_selector_dev_period }}
  - selector: '{{ loki_selector_prod }}'
    priority: {{ loki_selector_prod_priority }}
    period: {{ loki_selector_prod_period }}

chunk_store_config:
  max_look_back_period: 0s

table_manager:
  retention_deletes_enabled: false
  retention_period: 0s

ruler:
  storage:
    type: local
    local:
      directory: /data/loki/rules
  rule_path: /data/loki/rules-temp
  alertmanager_url: http://localhost:9093
  ring:
    kvstore:
      store: inmemory
  enable_api: true
2023-03-27T09:24:10+02:00       level=error ts=2023-03-27T07:24:10.368980096Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
2023-03-27T09:24:10+02:00       level=error ts=2023-03-27T07:24:10.369003636Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
2023-03-27T09:24:10+02:00       level=error ts=2023-03-27T07:24:10.369057955Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
2023-03-27T09:24:10+02:00       level=error ts=2023-03-27T07:24:10.369066688Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
2023-03-27T09:24:10+02:00       level=error ts=2023-03-27T07:24:10.369076854Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
2023-03-27T09:24:10+02:00       level=error ts=2023-03-27T07:24:10.369085519Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
2023-03-27T09:24:10+02:00       level=error ts=2023-03-27T07:24:10.369094081Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
2023-03-27T09:24:10+02:00       level=error ts=2023-03-27T07:24:10.369102732Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
2023-03-27T09:24:10+02:00       level=error ts=2023-03-27T07:24:10.36911159Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
2023-03-27T09:24:10+02:00       level=error ts=2023-03-27T07:24:10.369120013Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
2023-03-27T09:24:10+02:00       level=error ts=2023-03-27T07:24:10.369128761Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
2023-03-27T09:24:10+02:00       level=error ts=2023-03-27T07:24:10.369137047Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
2023-03-27T09:24:10+02:00       level=error ts=2023-03-27T07:24:10.369147192Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
2023-03-27T09:24:10+02:00       level=error ts=2023-03-27T07:24:10.36915603Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
2023-03-27T09:24:10+02:00       level=error ts=2023-03-27T07:24:10.371227278Z caller=scheduler_processor.go:182 org_id=fake msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled" frontend=172.18.0.3:9095
2023-03-27T09:24:10+02:00       level=error ts=2023-03-27T07:24:10.371240766Z caller=scheduler_processor.go:137 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=127.0.0.1:9095
2023-03-27T09:24:10+02:00       level=error ts=2023-03-27T07:24:10.371247471Z caller=scheduler_processor.go:182 org_id=fake msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled" frontend=172.18.0.3:9095
RobbanHoglund commented 1 year ago

Same for me. I am using Loki version 2.6.1

`

level=error ts=2023-03-20T17:25:49.317284526Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317371656Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317414398Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.31745973Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317476668Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317464179Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317479047Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317501386Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317506001Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317504383Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317488955Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.31751972Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317523482Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317467714Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317537508Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317541672Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317547893Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317557217Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317554939Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317566796Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317571214Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317558996Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317580561Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"

`

imranismail commented 1 year ago

It's happening on our environment as well, we've observed no OOMs or insufficient CPU/Memory.

gillbates commented 1 year ago

same issue, using official sample compose ... anyone could fix ?

principekiss commented 1 year ago

Same issue here using loki Helm Chat 4.10.0

FredTreg commented 1 year ago

Same issue on monolithic Loki 2.8.0 docker with a basic config using tsdb and very few logs (no OOM). The impact in Grafana is a graph not loading on a regular basis. I display 2 panels on a dashboard:

Only the first panel fails. If I remove the second panel, the first one never fails and query is much faster so I suspect a concurrency issue with the concurrent retrieval of the logs.

I only get the "scheduler" error message:

monitoring-loki-1  | level=error ts=2023-04-22T13:22:08.764144302Z caller=scheduler_processor.go:158 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=172.19.0.5:9095
kamikaze commented 1 year ago

Same with loki-distributed-0.69.14 helm chart.

Grafana is working for a while and then just freezes when querying Loki

srikavya-kola commented 1 year ago

Facing same issue, any update on this please?

alexey107 commented 1 year ago

Same issue. Noticed that it appears only when the query is wide with no filters. Loki 2.7.5 monolit, grafana 7.5.17, no oom killers

darzanebor commented 1 year ago

any updates on this issue? loki 2.7.5

Stringls commented 1 year ago

Same issue for me. I am installing Loki using official helm chart Loki version 2.8.2

CodebashingDevOps commented 1 year ago

+1, getting the same errors

JeffCT0216 commented 1 year ago

+1 on loki 2.8.1

Getting this error and then context canceled

jlcrow commented 1 year ago

Getting this same error but with Mimir

raypettersen commented 1 year ago

We are also observing the same problem. Seems to be the loki read pods.

mwolter805 commented 1 year ago

Here's what worked to resolve the issue for my small single server monolithic setup. YMMV

Received up to 20 each of the following in the Loki container logs when selecting a filter in Grafana.

13
level=error ts=2023-07-07T14:01:11.629049296Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 query="{container=\"esphome\"} |= \"\"" err="context canceled"

level=error ts=2023-07-07T14:01:11.629982955Z caller=scheduler_processor.go:158 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=127.0.0.1:9096

level=error ts=2023-07-07T14:01:11.629992596Z caller=scheduler_processor.go:106 msg="error processing requests from scheduler" err="rpc error: code = Canceled desc = context canceled" addr=127.0.0.1:9096

level=error ts=2023-07-07T14:01:11.63036167Z caller=scheduler_processor.go:208 org_id=fake frontend=127.0.0.1:9096 msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled"

level=error ts=2023-07-07T14:01:11.630377204Z caller=scheduler_processor.go:252 org_id=fake frontend=127.0.0.1:9096 msg="error health checking" err="rpc error: code = Canceled desc = context canceled"

Added "parallelise_shardable_queries: false" to the Loki config

query_range:
  parallelise_shardable_queries: false

In the Grafana Loki data source, changed the timeout to 360 and this completely eliminated the errors.

Screenshot 2023-07-07 at 8 20 48 AM

imranismail commented 1 year ago

Here's what worked to resolve the issue for my small single server monolithic setup. YMMV

Received up to 20 each of the following in the Loki container logs when selecting a filter in Grafana.

13
level=error ts=2023-07-07T14:01:11.629049296Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 query="{container=\"esphome\"} |= \"\"" err="context canceled"

level=error ts=2023-07-07T14:01:11.629982955Z caller=scheduler_processor.go:158 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=127.0.0.1:9096

level=error ts=2023-07-07T14:01:11.629992596Z caller=scheduler_processor.go:106 msg="error processing requests from scheduler" err="rpc error: code = Canceled desc = context canceled" addr=127.0.0.1:9096

level=error ts=2023-07-07T14:01:11.63036167Z caller=scheduler_processor.go:208 org_id=fake frontend=127.0.0.1:9096 msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled"

level=error ts=2023-07-07T14:01:11.630377204Z caller=scheduler_processor.go:252 org_id=fake frontend=127.0.0.1:9096 msg="error health checking" err="rpc error: code = Canceled desc = context canceled"

Added "parallelise_shardable_queries: false" to the Loki config

query_range:
  parallelise_shardable_queries: false

In the Grafana Loki data source, changed the timeout to 360 and this completely eliminated the errors.

Screenshot 2023-07-07 at 8 20 48 AM

We have over 20 clusters sending their logs to our Loki instance, disabling that option is a no go for us.

cjimti commented 1 year ago

We get the same errors, and they seem to correlate to a full lockup of loki based dashboard panels. Sadly we have a cron that reboots the loki distributed querier every few hours so we have some functional logs-based metrics. Since it has been almost a year now with no solution, we have started looking for other options.

a-patos commented 1 year ago

same for me with loki 2.8.2

KyrumX commented 1 year ago

Here's what worked to resolve the issue for my small single server monolithic setup. YMMV

Received up to 20 each of the following in the Loki container logs when selecting a filter in Grafana.

13
level=error ts=2023-07-07T14:01:11.629049296Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 query="{container=\"esphome\"} |= \"\"" err="context canceled"

level=error ts=2023-07-07T14:01:11.629982955Z caller=scheduler_processor.go:158 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=127.0.0.1:9096

level=error ts=2023-07-07T14:01:11.629992596Z caller=scheduler_processor.go:106 msg="error processing requests from scheduler" err="rpc error: code = Canceled desc = context canceled" addr=127.0.0.1:9096

level=error ts=2023-07-07T14:01:11.63036167Z caller=scheduler_processor.go:208 org_id=fake frontend=127.0.0.1:9096 msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled"

level=error ts=2023-07-07T14:01:11.630377204Z caller=scheduler_processor.go:252 org_id=fake frontend=127.0.0.1:9096 msg="error health checking" err="rpc error: code = Canceled desc = context canceled"

Added "parallelise_shardable_queries: false" to the Loki config

query_range:
  parallelise_shardable_queries: false

In the Grafana Loki data source, changed the timeout to 360 and this completely eliminated the errors.

Setting parallelise_shardable_queries: false seems to fix it as stated above. However obviously this isn't a perfect solution

bmgante commented 1 year ago

Hi all,

Is there some progress on this issue? What is the real performance impact of setting parallelise_shardable_queries to false if I have s3 as object storage? Any expectation to have this addressed without having to disable the feature? Lastly, I have all the huge amount of errors when running Loki queries but the queries return data, so my question is just to understand if all these error logs can be just discarded if we want to keep parallelise_shardable_queries true.

Thanks

pingping95 commented 1 year ago

Same issue...

image
KevinDW-Fluxys commented 1 year ago

Same for us, we cant disable parallelise_shardable_queries since it would impact performance too much.

Clasyc commented 1 year ago

Is there any update on this? In our case, the flood of unnecessary logs makes it almost impossible to find anything useful.

patsevanton commented 1 year ago

has anyone come up with a workaround for this issue?

bioszombie commented 11 months ago

I have the same issue. Any updates?

VenkateswaranJ commented 11 months ago

Same with 2.9.0

yanyiup commented 10 months ago

Is there a specific solution available?

silentirk commented 10 months ago

Same issue with 2.9.1 and 2.9.3, monolithic setup.