WAL corruption leads to endless restarts

grafana / loki

Like Prometheus, but for logs.

https://grafana.com/loki

GNU Affero General Public License v3.0

22.95k stars 3.34k forks source link

WAL corruption leads to endless restarts #12583

Open hervenicol opened 3 months ago

hervenicol commented 3 months ago

Describe the bug

loki-write pod is dying with this log:

msg="error running loki" err="corruption in segment /var/loki/tsdb-shipper-active/wal/s3_2024-01-02/1712203235/00000004 at 65536: last record is torn\nerror recovering from TSDB WAL"

and restarts indefinitely (crashlooping).

But at each restart it reads the WAL and updates object storage. On a big WAL this can cost a lot because all the data is sent to the object storage again and again.

To Reproduce

Steps to reproduce the behavior:

Running Loki 2.9.6
It happens once in a while on clusters that are a bit undersized and where pods tend to die OOM.
This does not happen consistently.

Expected behavior I can understand the WAL can get corrupted when the app unexpectedly crashes. But maybe when the WAL is corrupted it should discard it? So after it crashes once it can start properly, and not retry endlessly?

Environment:

Infrastructure: Kubernetes
Deployment tool: helm 5.47.2

DylanGuedes commented 3 months ago

But maybe when the WAL is corrupted it should discard it?

tbh I prefer the current behavior, as it at least gives you the chance of doing a backup of your corrupted WAL etc.

slim-bean commented 3 months ago

the corruption and restart loops are rough I don't think I like this behavior. It can make it hard to recover from an outage caused by being overwhelmed with data when multiple ingesters may be in this situation.

and in particular here the corruption comes from the last segment which failed because it hit the disk limit, so if we just truncated the WAL at whatever we could read that's really the best you can do anyway.

I did look at the code paths here a little while ago and it wasn't a quick fix to change the behavior, but I do think I'd prefer that we handle this automatically.

interestingly we don't ever see this and after some discussions I've had with folks that do I think the difference is we have 100GB PVC and we typically only sent at most about 15MB/s to an ingester, so it becomes pretty hard to for us to run out of WAL disk before running out of memory.

jcdauchy-moodys commented 2 months ago

Would there be something to do in case this corruption happens, I hit this issue error running loki" err="corruption in segment /var/loki/tsdb-index/wal and the quick solution was to recreate the PVC empty (and loose some logs of course) ?

Thanks

Farisagianda commented 1 month ago

Getting the same on prod environment, any updates on this?

DanielCastronovo commented 1 month ago

Same here

t4ov commented 3 weeks ago

yep, same for us too.

LukoJy3D commented 2 weeks ago

Just had this as well, interestingly enough it was on a month-old segment that was about to be deleted by retention:

msg="error running loki" err="corruption in segment /var/loki/tsdb-shipper-active/wal/gcs_2024-06-01/1719931836/00000000 at 8255: unexpected checksum a7f03ba2

Or maybe it is corrupted on that PVC from earlier on as it scales up and down. I will check if that repeats. Either way, suggestions on how to handle it despite deleting the PVC would be appreciated :pray:

oskarm93 commented 1 week ago

Similar problem in Loki 3.1 on AKS disks. code-stdin-4UP.txt I have been deleted PVC and pod so far to recover but this is very cumbersome.