Open RobbanHoglund opened 1 month ago
Also seeing this in a cluster running 3.2.0, however its only happening on one of our Read pods (Running SSD via Helm), the same pod is reporting High query latency (upto 50s), and reporting a fair amount of context cancelled in the logs, All the other read pods are behaving just fine and returning query in sub-seconds
I ran into something similar with 3.1.1 . Restarting the broken pod seem to have fixed that.
Indeed, I ran into this issue is 3.2.0
and killing the pod and spin up a new one fixed it
I'm having the same issue with Loki 3.2.0 deployed using Docker and SSD. Once in a while Read container return msg="failed mapping AST" err="context canceled" errors, only way to resolve is to restart the read container. I don't spot any issues in metrics...
+1 here. When it happens, seems more severe on one specifically more then the others. Running loki 3.2.1
Restarting the pod works for me as well.
We are running monolithic mode and we have restarted the Loki processes several times but nothing seems to help.
Same issue in my case, deployed loki in simple scalable mode. When I run the query, I keep seeing this error. Restarting or redeployment does not help.
Ok so i restarted the read pods, still get the error, but no more EOF errors/time outs from queries. It seems a good way to detect if the impact is noticable on client's perspective is to check if there is any 499 HTTP code error returned from the nginx. There is probably some loki metrics somewhere that indicate it, but i could not find it. Also, no idea yet how to go around the problem. Will try to look at the code if i can see where this error comes from and what we can do to go around it.
I did some digging, and my initial problems of EOF/timeout is not linked to that warning. I tried to removed a bunch of thing from my config hoping to remove this warning, could not find any way to remove it. Remove my old store config, did not make it dissapear. Looking at the code, it seems to happen when parsing the query and sharding the query.
Will sadly stop the investigation here since my query problems does not seem to be linked to this warning. Still, would be nice to have this warning dissapear next release.
+1 we also see this issue occasionally and deleting the affected pod fixes the issue.
We run loki 3.1.1
with deploymentMode: SimpleScalable
and also saw the same error. Restarting the loki-read
deployment fixed the issue in seconds.
I'm experiencing the same issue. I have a few clusters running Loki 3.2.1 in Simple Scalable Mode (SSD). All clusters display the warning "failed mapping AST," though it doesn’t seem to impact Loki’s functionality, except in two specific clusters. In these cases, similar to previous reports, one of the Pods stops processing queries and only resumes normal operation after a restart.
This issue began occurring only after upgrading from version 2.9.X to 3.2.X, following the guide for upgrading to version 3.X.
As @someStrangerFromTheAbyss mentioned, there is a noticeable increase in 499 HTTP status codes, which can be seen on the "Loki Operational" Grafana Dashboard.
In my observations, the affected Pod still receives requests from Nginx (loki-gateway) but doesn’t return a response, appearing to get stuck while processing. My deployment relies on AWS S3.
For the http code 499, i can confirm these are not linked. I was able to resolve the 499 issue seperately from that error message. I made a really big answer on the 499 issue here: https://github.com/grafana/loki/issues/7084
TLDR: The 499 issue is linked to the boltdb-shipper indexer and connection to the backend when using boltdb-shipper. If you can remove these, the 499 should dissapear.
Happened to use as well. Any idea how to catch this with a livenessProbe or something similar? It fails quite silently and you will only notice once you use it (schrödingers loki).
Describe the bug After upgrade to Loki 3.2.0 it is repeatedly logging this for some queries: ts=2024-10-04T01:09:28.649297614Z caller=spanlogger.go:111 middleware=QueryShard.astMapperware org_id=fake traceID=35d59f6204919421 user=fake caller=log.go:168 level=warn msg="failed mapping AST" err="context canceled" query="{application=\"myapp\",level=~\"INFO|WARN|ERROR\"} "
To Reproduce Steps to reproduce the behavior:
Expected behavior If this is an actual problem in Loki 3.2.0 it would be good to get more information about what is the root cause.
Environment:
Screenshots, Promtail config, or terminal output If applicable, add any output to help explain your problem.