prometheus-operator / runbooks

https://runbooks.prometheus-operator.dev
Apache License 2.0
93 stars 172 forks source link

PrometheusTSDBCompactionsFailing instructions for corrupted WAL files #53

Open elchenberg opened 1 year ago

elchenberg commented 1 year ago

When I had PrometheusTSDBCompactionsFailing alerts I had corrupted WAL files (with error messages in the logs looking like this: WAL truncation in Compact: create checkpoint: read segments: corruption in segment /prometheus/wal/00018151 at 72: unexpected full record).

With the following procedure I was able to fix the issue:

  1. Exec into the pod (or find the mount path of the PersistentVolumeClaim on the host) and delete the corrupted file (in the example above: rm /prometheus/wal/00018151).
  2. Delete all the WAL files in /prometheus/wal that are older than the file deleted in the previous step (for example rm /prometheus/wal/00018150).
  3. Create empty files in the place of all the deleted files from the previous steps (for example touch /prometheus/wal/00018150 /prometheus/wal/00018151).
  4. Make sure the file ownership and permissions are the same as with the other WAL files (eg. chown 1000:2000 /prometheus/wal/00018150 /prometheus/wal/00018151 and chmod g+w /prometheus/wal/00018150 /prometheus/wal/00018151).
  5. Restart the pod.
  6. Depending on how long ago the last successful compaction was, the next compaction might use a lot of memory and take a while. Look out if the pod gets out-of-memory-killed and (temporarily) increase the memory requests and limits of the prometheus container. Disable the startupProbe and the livenessProbe if the container terminates with exit code zero and you see the message "See you next time!" in the logs and a failed startup probe in the pod events (kubectl describe).

I do not know if this is good practice, though.

Should I open a pull request to extend the PrometheusTSDBCompactionsFailing runbook?