grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.99k stars 3.46k forks source link

failed mapping AST context canceled #14384

Open RobbanHoglund opened 1 month ago

RobbanHoglund commented 1 month ago

Describe the bug After upgrade to Loki 3.2.0 it is repeatedly logging this for some queries: ts=2024-10-04T01:09:28.649297614Z caller=spanlogger.go:111 middleware=QueryShard.astMapperware org_id=fake traceID=35d59f6204919421 user=fake caller=log.go:168 level=warn msg="failed mapping AST" err="context canceled" query="{application=\"myapp\",level=~\"INFO|WARN|ERROR\"} "

To Reproduce Steps to reproduce the behavior:

  1. Started Loki 3.2.0
  2. From Grafana explore do {application="myapp",level=~"INFO|WARN|ERROR|TRACE"}

Expected behavior If this is an actual problem in Loki 3.2.0 it would be good to get more information about what is the root cause.

Environment:

Screenshots, Promtail config, or terminal output If applicable, add any output to help explain your problem. Image Image

jammiemil commented 1 month ago

Also seeing this in a cluster running 3.2.0, however its only happening on one of our Read pods (Running SSD via Helm), the same pod is reporting High query latency (upto 50s), and reporting a fair amount of context cancelled in the logs, All the other read pods are behaving just fine and returning query in sub-seconds

eplightning commented 1 month ago

I ran into something similar with 3.1.1 . Restarting the broken pod seem to have fixed that.

yalattas commented 1 month ago

Indeed, I ran into this issue is 3.2.0 and killing the pod and spin up a new one fixed it

vincentnonim commented 1 month ago

I'm having the same issue with Loki 3.2.0 deployed using Docker and SSD. Once in a while Read container return msg="failed mapping AST" err="context canceled" errors, only way to resolve is to restart the read container. I don't spot any issues in metrics...

someStrangerFromTheAbyss commented 3 weeks ago

+1 here. When it happens, seems more severe on one specifically more then the others. Running loki 3.2.1

JeffreyVdb commented 3 weeks ago

Restarting the pod works for me as well.

RobbanHoglund commented 3 weeks ago

We are running monolithic mode and we have restarted the Loki processes several times but nothing seems to help.

Khagesh16 commented 3 weeks ago

Same issue in my case, deployed loki in simple scalable mode. When I run the query, I keep seeing this error. Restarting or redeployment does not help.

someStrangerFromTheAbyss commented 2 weeks ago

Ok so i restarted the read pods, still get the error, but no more EOF errors/time outs from queries. It seems a good way to detect if the impact is noticable on client's perspective is to check if there is any 499 HTTP code error returned from the nginx. There is probably some loki metrics somewhere that indicate it, but i could not find it. Also, no idea yet how to go around the problem. Will try to look at the code if i can see where this error comes from and what we can do to go around it.

someStrangerFromTheAbyss commented 2 weeks ago

I did some digging, and my initial problems of EOF/timeout is not linked to that warning. I tried to removed a bunch of thing from my config hoping to remove this warning, could not find any way to remove it. Remove my old store config, did not make it dissapear. Looking at the code, it seems to happen when parsing the query and sharding the query.

Will sadly stop the investigation here since my query problems does not seem to be linked to this warning. Still, would be nice to have this warning dissapear next release.

emadolsky commented 2 weeks ago

+1 we also see this issue occasionally and deleting the affected pod fixes the issue.

BlexToGo commented 2 weeks ago

We run loki 3.1.1 with deploymentMode: SimpleScalable and also saw the same error. Restarting the loki-read deployment fixed the issue in seconds.

marcotuna commented 2 weeks ago

I'm experiencing the same issue. I have a few clusters running Loki 3.2.1 in Simple Scalable Mode (SSD). All clusters display the warning "failed mapping AST," though it doesn’t seem to impact Loki’s functionality, except in two specific clusters. In these cases, similar to previous reports, one of the Pods stops processing queries and only resumes normal operation after a restart.

This issue began occurring only after upgrading from version 2.9.X to 3.2.X, following the guide for upgrading to version 3.X.

As @someStrangerFromTheAbyss mentioned, there is a noticeable increase in 499 HTTP status codes, which can be seen on the "Loki Operational" Grafana Dashboard.

In my observations, the affected Pod still receives requests from Nginx (loki-gateway) but doesn’t return a response, appearing to get stuck while processing. My deployment relies on AWS S3.

someStrangerFromTheAbyss commented 1 week ago

For the http code 499, i can confirm these are not linked. I was able to resolve the 499 issue seperately from that error message. I made a really big answer on the 499 issue here: https://github.com/grafana/loki/issues/7084

TLDR: The 499 issue is linked to the boltdb-shipper indexer and connection to the backend when using boltdb-shipper. If you can remove these, the 499 should dissapear.

jan-kantert commented 2 days ago

Happened to use as well. Any idea how to catch this with a livenessProbe or something similar? It fails quite silently and you will only notice once you use it (schrödingers loki).