Open tasiotas opened 2 years ago
Experiencing the same issue, pretty much the same config, the debug level logs:
| | 2022-11-13 15:17:19 | Nov 14 09:47:18 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:18.984620841Z caller=mock.go:186 msg="Get - deadline exceeded" key=collectors/distributor
| | 2022-11-13 15:17:19 | Nov 14 09:47:18 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:18.984615691Z caller=mock.go:150 msg=Get key=collectors/scheduler wait_index=14
| | 2022-11-13 15:17:19 | Nov 14 09:47:18 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:18.984607606Z caller=mock.go:186 msg="Get - deadline exceeded" key=collectors/scheduler
| | 2022-11-13 15:17:19 | Nov 14 09:47:18 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:18.984601256Z caller=mock.go:150 msg=Get key=collectors/compactor wait_index=16
| | 2022-11-13 15:17:19 | Nov 14 09:47:18 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:18.984594181Z caller=mock.go:186 msg="Get - deadline exceeded" key=collectors/compactor
| | 2022-11-13 15:17:19 | Nov 14 09:47:18 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:18.984583804Z caller=mock.go:150 msg=Get key=collectors/ring wait_index=17
| | 2022-11-13 15:17:19 | Nov 14 09:47:18 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:18.984529071Z caller=mock.go:186 msg="Get - deadline exceeded" key=collectors/ring
| | 2022-11-13 15:17:18 | Nov 14 09:47:18 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:18.714265066Z caller=logging.go:76 traceID=3e788abb64e0e1cc orgID=fake msg="POST /loki/api/v1/push (204) 538.586µs"
| | 2022-11-13 15:17:18 | Nov 14 09:47:18 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:18.714180031Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=121.717µs msg="gRPC (success)"
| | 2022-11-13 15:17:18 | Nov 14 09:47:18 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:18.713825407Z caller=push.go:150 org_id=fake msg="push request parsed" path=/loki/api/v1/push contentType=application/x-protobuf contentEncoding= bodySize="1.5 kB" streams=3 entries=33 streamLabelsSize="134 B" entriesSize="5.5 kB" totalSize="5.7 kB" mostRecentLagMs=507
| | 2022-11-13 15:17:18 | Nov 14 09:47:17 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:17.983745696Z caller=mock.go:150 msg=Get key=collectors/ring wait_index=17
| | 2022-11-13 15:17:18 | Nov 14 09:47:17 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:17.98374076Z caller=mock.go:186 msg="Get - deadline exceeded" key=collectors/ring
| | 2022-11-13 15:17:18 | Nov 14 09:47:17 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:17.983734963Z caller=mock.go:150 msg=Get key=collectors/distributor wait_index=15
| | 2022-11-13 15:17:18 | Nov 14 09:47:17 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:17.983730374Z caller=mock.go:186 msg="Get - deadline exceeded" key=collectors/distributor
| | 2022-11-13 15:17:18 | Nov 14 09:47:17 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:17.983726097Z caller=mock.go:150 msg=Get key=collectors/scheduler wait_index=14
| | 2022-11-13 15:17:18 | Nov 14 09:47:17 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:17.983721069Z caller=mock.go:186 msg="Get - deadline exceeded" key=collectors/scheduler
| | 2022-11-13 15:17:18 | Nov 14 09:47:17 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:17.983713093Z caller=mock.go:150 msg=Get key=collectors/compactor wait_index=16
| | 2022-11-13 15:17:18 | Nov 14 09:47:17 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:17.983672895Z caller=mock.go:186 msg="Get - deadline exceeded" key=collectors/compactor
| | 2022-11-13 15:17:17 | Nov 14 09:47:17 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:17.613682421Z caller=logging.go:76 traceID=0b69d212465ca9fd orgID=fake msg="POST /loki/api/v1/push (204) 517.968µs"
| | 2022-11-13 15:17:17 | Nov 14 09:47:17 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:17.613572175Z caller=grpc_logging.go:46 method=/logproto.Pusher/Push duration=76.721µs msg="gRPC (success)"
| | 2022-11-13 15:17:17 | Nov 14 09:47:17 ned loki-linux-amd64[406979]: level=debug ts=2022-11-13T23:17:17.613234883Z caller=push.go:150 org_id=fake msg="push request parsed" path=/loki/api/v1/push contentType=application/x-protobuf contentEncoding= bodySize="1.5 kB" streams=3 entries=33 streamLabelsSize="134 B" entriesSize="5.5 kB" totalSize="5.7 kB" mostRecentLagMs=410
Tried adding wal config as per #2753 with no luck
ingester:
wal:
enabled: true
dir: /loki/wal
I'm getting the same issue.
We are running into the same issue with Loki hosted on Kubernetes and using Azure Blob storage.
The following is repeated in the logs:
level=error ts=2022-11-30T17:59:50.205452691Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
level=error ts=2022-11-30T17:59:50.205437491Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
level=error ts=2022-11-30T17:59:50.205424091Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
level=error ts=2022-11-30T17:59:50.205411191Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
level=error ts=2022-11-30T17:59:50.20539419Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
level=error ts=2022-11-30T17:59:50.20535119Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
level=error ts=2022-11-30T17:59:44.735217069Z caller=scheduler_processor.go:137 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=[redacted]:9095
level=error ts=2022-11-30T17:59:44.735171068Z caller=scheduler_processor.go:182 org_id=fake msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled" frontend=[redacted]:9095
level=error ts=2022-11-30T17:59:44.734160865Z caller=scheduler_processor.go:137 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=[redacted]:9095
level=error ts=2022-11-30T17:59:44.734131164Z caller=scheduler_processor.go:182 org_id=fake msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled" frontend=[redacted]:9095
level=error ts=2022-11-30T17:59:44.734013755Z caller=scheduler_processor.go:137 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=[redacted]:9095
level=error ts=2022-11-30T17:59:44.733509056Z caller=scheduler_processor.go:137 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=[redacted]:9095
level=error ts=2022-11-30T17:59:44.733488356Z caller=scheduler_processor.go:182 org_id=fake msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled" frontend=[redacted]:9095
level=error ts=2022-11-30T17:59:44.733398855Z caller=scheduler_processor.go:137 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=[redacted]:9095
level=error ts=2022-11-30T17:59:44.733297354Z caller=scheduler_processor.go:182 org_id=fake msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled" frontend=[redacted]:9095
level=error ts=2022-11-30T17:59:44.731098021Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
level=error ts=2022-11-30T17:59:44.731078221Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
level=error ts=2022-11-30T17:59:44.73101162Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
level=error ts=2022-11-30T17:59:44.730870719Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
level=error ts=2022-11-30T17:59:44.730831518Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
This is a dev environment, so I am open to making any suggested changes to resolve this.
I have same issue helm chart: loki-stack 2.8.7 loki: v2.6.1
I have same issue
I have same issue, even the default -config.file=/etc/loki/local-config.yaml file anyone know who to write promtail config pipeline_stages to drop those message?
i.e msg="error notifying scheduler about finished query"
Thank
I wonder if that PR is supposed to fix this issue https://github.com/grafana/loki/pull/7978
I wonder if that PR is supposed to fix this issue #7978
https://raw.githubusercontent.com/grafana/loki/v2.7.0/production/docker-compose.yaml
even this basic example have this issue. without doing any custom config.
I wonder if that PR is supposed to fix this issue #7978
https://raw.githubusercontent.com/grafana/loki/v2.7.0/production/docker-compose.yaml
even this basic example have this issue. without doing any custom config.
Oh i using version 2.7.1
Same issue with Loki version 2.7.3
msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled"
msg="error notifying scheduler about finished query" err=EOF
Same issue with Loki version 2.7.3
msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled" msg="error notifying scheduler about finished query" err=EOF
If I set query_range.parallelise_shardable_queries
to false
(Cf. grafana.slack.com) , I no longer have notifying frontend/scheduler
errors.
But this warning appears:
`msg="max concurrency is not evenly divisible across targets, adding an extra connection"`
Same also here
limits_config:
ingestion_rate_mb: 500
retention_period: 30d
per_stream_rate_limit: 512M
per_stream_rate_limit_burst: 1024M
max_query_series: 99999999
query_timeout: 5m
querier:
query_timeout: 5m
engine:
timeout: 5m
tried
/opt/loki/loki -target=all,table-manager -config.file=/opt/loki/loki-local-config.yaml -querier.engine.timeout=5m
in the service
in grafana.ini
[dataproxy]
timeout = 600
still timing out in 3 minutes
FIX FOR ME:
Finally I realized that in the data source config for loki (which you create in Grafana), a timeout can be set there which seems to override the grafana data proxy setting. Mine was set to 180 there and bumping it up allowed me to extend my query time for the data source. Hope this helps someone.
+1 Same issue:
Doing the above (grafana data source timeout) didn't seem to help for me. But I noticed on mine it kept restarting the pod on the same node... once I cordoned off that node and deleted pod (to force restart elsewhere) the issue went away. Doesn't make a huge amount of sense but I thought I'd share in case it adds a clue to what is going on.
Maybe the new loki node location forced some sort of networking reset on the promtail daemonset that were trying to connect to it? I dunno, grasping at straws here.
EOF errors are often indicative of something running out of memory and OOM crashing.
I suspect for most of the examples here your frontend pods are OOM crashing on queries.
This can happen for a few reasons, typically it's a logs query (metric queries return samples and not log lines so it's harder to OOM a frontend with them, but it's not impossible)
Run more frontends, or increase their memory limits, this is typically how you work around it.
There was a change made recently however to help with one subset of cases we saw this happening a lot but it would really only affect anyone querying with just label matchers {job="foo"}
with nothing else (no filters) where we too aggressively were parallelizing these queries and if the label selector matches enough data (typically TB a day for the streams) you could really thrash the frontends.
That change isn't in a release yet, but hopefully we'll have a release in a few weeks.
I thing the problem might be more related to loki default values and not what the frontend configuration would be, especially when there's none.
With a config similar as the op, on a test environment with one server having loki v2.7.4 +grafana v9.3.1 and promtail v2.7.4 scanning its own /var/log dir, I'm getting hundredth of lines of error like theses
Mar 4 12:45:05 prom loki[66304]: level=error ts=2023-03-04T11:45:04.909077383Z caller=scheduler_processor.go:137 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=127.0.0.1:3196
Mar 4 12:45:05 prom loki[66304]: level=error ts=2023-03-04T11:45:04.909093885Z caller=scheduler_processor.go:182 org_id=fake msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled" frontend=192.168.1.12:3196
It will occur non-stop when having a single log panel in grafana connected to loki with only the filter job="varlogs"
As the 192.168.1.12:3196 is the gRPC listen port, my question is : why Loki needs to connect to its own gRPC server and failling miserably ?
For the record, here are my server and frontend sections from curl http://localhost:3100/config
Please note I don't have a frontend section in loki config file, hence address="" and the port=0
server:
http_listen_network: tcp
http_listen_address: 0.0.0.0
http_listen_port: 3100
http_listen_conn_limit: 0
grpc_listen_network: tcp
grpc_listen_address: 0.0.0.0
grpc_listen_port: 3196
grpc_listen_conn_limit: 0
tls_cipher_suites: ""
tls_min_version: ""
http_tls_config:
cert_file: ""
key_file: ""
client_auth_type: ""
client_ca_file: ""
grpc_tls_config:
cert_file: ""
key_file: ""
client_auth_type: ""
client_ca_file: ""
register_instrumentation: true
graceful_shutdown_timeout: 30s
http_server_read_timeout: 30s
http_server_write_timeout: 30s
http_server_idle_timeout: 2m0s
grpc_server_max_recv_msg_size: 4194304
grpc_server_max_send_msg_size: 4194304
grpc_server_max_concurrent_streams: 100
grpc_server_max_connection_idle: 2562047h47m16.854775807s
grpc_server_max_connection_age: 2562047h47m16.854775807s
grpc_server_max_connection_age_grace: 2562047h47m16.854775807s
grpc_server_keepalive_time: 2h0m0s
grpc_server_keepalive_timeout: 20s
grpc_server_min_time_between_pings: 10s
grpc_server_ping_without_stream_allowed: true
log_format: logfmt
log_level: warn
log_source_ips_enabled: false
log_source_ips_header: ""
log_source_ips_regex: ""
log_request_at_info_level_enabled: false
http_path_prefix: ""
(...)
frontend:
log_queries_longer_than: 0s
max_body_size: 10485760
query_stats_enabled: false
max_outstanding_per_tenant: 2048
querier_forget_delay: 0s
scheduler_address: ""
scheduler_dns_lookup_period: 10s
scheduler_worker_concurrency: 5
grpc_client_config:
max_recv_msg_size: 104857600
max_send_msg_size: 104857600
grpc_compression: ""
rate_limit: 0
rate_limit_burst: 0
backoff_on_ratelimits: false
backoff_config:
min_period: 100ms
max_period: 10s
max_retries: 10
tls_enabled: false
tls_cert_path: ""
tls_key_path: ""
tls_ca_path: ""
tls_server_name: ""
tls_insecure_skip_verify: false
tls_cipher_suites: ""
tls_min_version: ""
instance_interface_names:
- ens32
- lo
address: ""
port: 0
compress_responses: false
downstream_url: ""
tail_proxy_url: ""
tail_tls_config:
tls_cert_path: ""
tls_key_path: ""
tls_ca_path: ""
tls_server_name: ""
tls_insecure_skip_verify: false
tls_cipher_suites: ""
tls_min_version: ""
@usmangt I removed Observability Logs squad as this is related to Loki, not Loki data source in Grafana.
+1 same issue
Same issue here - no pods are getting OOM killed and everything still seems to be working properly.
@slim-bean can you link the possibly related issue for users to follow?
Just to point out, as far as I am aware, I'm running Loki in Monolithic mode, not sure if there is any frontend service running.
Also I don't have any frontend configuration in my loki-config.yml
, so it's picking up defaults.
It should be easily reproducible with my docker-compose file.
I have the same issue, I am using Loki in the Version 2.7.4 as a Docker Container (Monolithic mode). Setting the Server timeout did not help. There is enough RAM available, which is not needed.
My Loki config
auth_enabled: false
server:
http_listen_port: {{ loki_port }}
http_server_read_timeout: 120s
http_server_write_timeout: 120s
log_level: {{ loki_log_level }}
ingester:
lifecycler:
address: 127.0.0.1
ring:
kvstore:
store: inmemory
replication_factor: 1
final_sleep: 0s
chunk_idle_period: 30m # Any chunk not receiving new logs in this time will be flushed
max_chunk_age: 1h # All chunks will be flushed when they hit this age, default is 1h
chunk_target_size: 1048576 # Loki will attempt to build chunks up to 1.5MB, flushing first if chunk_idle_period or max_chunk_age is reached first
chunk_retain_period: 5m # Must be greater than index read cache TTL if using an index cache (Default index read cache TTL is 5m)
max_transfer_retries: 0 # Chunk transfers disabled
schema_config:
configs:
- from: 2020-10-24
store: boltdb-shipper
object_store: s3
schema: v11
index:
prefix: index_
period: 24h #The index period must be 24h
storage_config:
boltdb_shipper:
active_index_directory: /data/loki/boltdb-shipper-active
cache_location: /data/loki/boltdb-shipper-cache
cache_ttl: 1m # Can be increased for faster performance over longer query periods, uses more disk space
shared_store: s3
aws:
s3: {{ loki_s3_url }}
s3forcepathstyle: true
compactor:
working_directory: /data/loki/boltdb-shipper-compactor
shared_store: s3
compaction_interval: {{ loki_compaction_interval }}
retention_enabled: {{ loki_retention_enabled }}
retention_delete_delay: {{ loki_retention_delete_delay }}
retention_delete_worker_count: {{ loki_retention_delete_worker_count }}
limits_config:
reject_old_samples: true
reject_old_samples_max_age: 168h
ingestion_burst_size_mb: 128
ingestion_rate_mb: 64
#max_streams_per_user: 0
retention_period: {{ loki_retention_global_period }}
retention_stream:
- selector: '{{ loki_selector_dev }}'
priority: {{ loki_selector_dev_priority }}
period: {{ loki_selector_dev_period }}
- selector: '{{ loki_selector_prod }}'
priority: {{ loki_selector_prod_priority }}
period: {{ loki_selector_prod_period }}
chunk_store_config:
max_look_back_period: 0s
table_manager:
retention_deletes_enabled: false
retention_period: 0s
ruler:
storage:
type: local
local:
directory: /data/loki/rules
rule_path: /data/loki/rules-temp
alertmanager_url: http://localhost:9093
ring:
kvstore:
store: inmemory
enable_api: true
2023-03-27T09:24:10+02:00 level=error ts=2023-03-27T07:24:10.368980096Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
2023-03-27T09:24:10+02:00 level=error ts=2023-03-27T07:24:10.369003636Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
2023-03-27T09:24:10+02:00 level=error ts=2023-03-27T07:24:10.369057955Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
2023-03-27T09:24:10+02:00 level=error ts=2023-03-27T07:24:10.369066688Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
2023-03-27T09:24:10+02:00 level=error ts=2023-03-27T07:24:10.369076854Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
2023-03-27T09:24:10+02:00 level=error ts=2023-03-27T07:24:10.369085519Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
2023-03-27T09:24:10+02:00 level=error ts=2023-03-27T07:24:10.369094081Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
2023-03-27T09:24:10+02:00 level=error ts=2023-03-27T07:24:10.369102732Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
2023-03-27T09:24:10+02:00 level=error ts=2023-03-27T07:24:10.36911159Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
2023-03-27T09:24:10+02:00 level=error ts=2023-03-27T07:24:10.369120013Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
2023-03-27T09:24:10+02:00 level=error ts=2023-03-27T07:24:10.369128761Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
2023-03-27T09:24:10+02:00 level=error ts=2023-03-27T07:24:10.369137047Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
2023-03-27T09:24:10+02:00 level=error ts=2023-03-27T07:24:10.369147192Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
2023-03-27T09:24:10+02:00 level=error ts=2023-03-27T07:24:10.36915603Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
2023-03-27T09:24:10+02:00 level=error ts=2023-03-27T07:24:10.371227278Z caller=scheduler_processor.go:182 org_id=fake msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled" frontend=172.18.0.3:9095
2023-03-27T09:24:10+02:00 level=error ts=2023-03-27T07:24:10.371240766Z caller=scheduler_processor.go:137 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=127.0.0.1:9095
2023-03-27T09:24:10+02:00 level=error ts=2023-03-27T07:24:10.371247471Z caller=scheduler_processor.go:182 org_id=fake msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled" frontend=172.18.0.3:9095
Same for me. I am using Loki version 2.6.1
`
level=error ts=2023-03-20T17:25:49.317284526Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317371656Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317414398Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.31745973Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317476668Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317464179Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317479047Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317501386Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317506001Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317504383Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317488955Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.31751972Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317523482Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317467714Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317537508Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317541672Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317547893Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317557217Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317554939Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317566796Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317571214Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317558996Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2023-03-20T17:25:49.317580561Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
`
It's happening on our environment as well, we've observed no OOMs or insufficient CPU/Memory.
same issue, using official sample compose ... anyone could fix ?
Same issue here using loki Helm Chat 4.10.0
Same issue on monolithic Loki 2.8.0 docker with a basic config using tsdb and very few logs (no OOM). The impact in Grafana is a graph not loading on a regular basis. I display 2 panels on a dashboard:
count_over_time
of the logsOnly the first panel fails. If I remove the second panel, the first one never fails and query is much faster so I suspect a concurrency issue with the concurrent retrieval of the logs.
I only get the "scheduler" error message:
monitoring-loki-1 | level=error ts=2023-04-22T13:22:08.764144302Z caller=scheduler_processor.go:158 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=172.19.0.5:9095
Same with loki-distributed-0.69.14 helm chart.
Grafana is working for a while and then just freezes when querying Loki
Facing same issue, any update on this please?
Same issue. Noticed that it appears only when the query is wide with no filters. Loki 2.7.5 monolit, grafana 7.5.17, no oom killers
any updates on this issue? loki 2.7.5
Same issue for me. I am installing Loki using official helm chart Loki version 2.8.2
+1, getting the same errors
+1 on loki 2.8.1
Getting this error and then context canceled
Getting this same error but with Mimir
We are also observing the same problem. Seems to be the loki read pods.
Here's what worked to resolve the issue for my small single server monolithic setup. YMMV
Received up to 20 each of the following in the Loki container logs when selecting a filter in Grafana.
13
level=error ts=2023-07-07T14:01:11.629049296Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 query="{container=\"esphome\"} |= \"\"" err="context canceled"
level=error ts=2023-07-07T14:01:11.629982955Z caller=scheduler_processor.go:158 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=127.0.0.1:9096
level=error ts=2023-07-07T14:01:11.629992596Z caller=scheduler_processor.go:106 msg="error processing requests from scheduler" err="rpc error: code = Canceled desc = context canceled" addr=127.0.0.1:9096
level=error ts=2023-07-07T14:01:11.63036167Z caller=scheduler_processor.go:208 org_id=fake frontend=127.0.0.1:9096 msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled"
level=error ts=2023-07-07T14:01:11.630377204Z caller=scheduler_processor.go:252 org_id=fake frontend=127.0.0.1:9096 msg="error health checking" err="rpc error: code = Canceled desc = context canceled"
Added "parallelise_shardable_queries: false" to the Loki config
query_range:
parallelise_shardable_queries: false
In the Grafana Loki data source, changed the timeout to 360 and this completely eliminated the errors.
Here's what worked to resolve the issue for my small single server monolithic setup. YMMV
Received up to 20 each of the following in the Loki container logs when selecting a filter in Grafana.
13 level=error ts=2023-07-07T14:01:11.629049296Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 query="{container=\"esphome\"} |= \"\"" err="context canceled" level=error ts=2023-07-07T14:01:11.629982955Z caller=scheduler_processor.go:158 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=127.0.0.1:9096 level=error ts=2023-07-07T14:01:11.629992596Z caller=scheduler_processor.go:106 msg="error processing requests from scheduler" err="rpc error: code = Canceled desc = context canceled" addr=127.0.0.1:9096 level=error ts=2023-07-07T14:01:11.63036167Z caller=scheduler_processor.go:208 org_id=fake frontend=127.0.0.1:9096 msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled" level=error ts=2023-07-07T14:01:11.630377204Z caller=scheduler_processor.go:252 org_id=fake frontend=127.0.0.1:9096 msg="error health checking" err="rpc error: code = Canceled desc = context canceled"
Added "parallelise_shardable_queries: false" to the Loki config
query_range: parallelise_shardable_queries: false
In the Grafana Loki data source, changed the timeout to 360 and this completely eliminated the errors.
We have over 20 clusters sending their logs to our Loki instance, disabling that option is a no go for us.
We get the same errors, and they seem to correlate to a full lockup of loki based dashboard panels. Sadly we have a cron that reboots the loki distributed querier every few hours so we have some functional logs-based metrics. Since it has been almost a year now with no solution, we have started looking for other options.
same for me with loki 2.8.2
Here's what worked to resolve the issue for my small single server monolithic setup. YMMV
Received up to 20 each of the following in the Loki container logs when selecting a filter in Grafana.
13 level=error ts=2023-07-07T14:01:11.629049296Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 query="{container=\"esphome\"} |= \"\"" err="context canceled" level=error ts=2023-07-07T14:01:11.629982955Z caller=scheduler_processor.go:158 org_id=fake msg="error notifying scheduler about finished query" err=EOF addr=127.0.0.1:9096 level=error ts=2023-07-07T14:01:11.629992596Z caller=scheduler_processor.go:106 msg="error processing requests from scheduler" err="rpc error: code = Canceled desc = context canceled" addr=127.0.0.1:9096 level=error ts=2023-07-07T14:01:11.63036167Z caller=scheduler_processor.go:208 org_id=fake frontend=127.0.0.1:9096 msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled" level=error ts=2023-07-07T14:01:11.630377204Z caller=scheduler_processor.go:252 org_id=fake frontend=127.0.0.1:9096 msg="error health checking" err="rpc error: code = Canceled desc = context canceled"
Added "parallelise_shardable_queries: false" to the Loki config
query_range: parallelise_shardable_queries: false
In the Grafana Loki data source, changed the timeout to 360 and this completely eliminated the errors.
Setting
parallelise_shardable_queries: false
seems to fix it as stated above. However obviously this isn't a perfect solution
Hi all,
Is there some progress on this issue? What is the real performance impact of setting parallelise_shardable_queries to false if I have s3 as object storage? Any expectation to have this addressed without having to disable the feature? Lastly, I have all the huge amount of errors when running Loki queries but the queries return data, so my question is just to understand if all these error logs can be just discarded if we want to keep parallelise_shardable_queries true.
Thanks
Same issue...
Same for us, we cant disable parallelise_shardable_queries since it would impact performance too much.
Is there any update on this? In our case, the flood of unnecessary logs makes it almost impossible to find anything useful.
has anyone come up with a workaround for this issue?
I have the same issue. Any updates?
Same with 2.9.0
Is there a specific solution available?
Same issue with 2.9.1 and 2.9.3, monolithic setup.
Hi,
I am getting a lot of those errors:
I found similar issue about related to frontend address, I added it to config, but didnt help
Here is my docker-compose.yml
loki-config.yml based on complete-local-config.yaml from Docs
Any ideas what is causing it? Thank you