[kube-prometheus-stack] prometheus pod not starting due to endless WAL recovery loop

thomas-vt commented 1 year ago

Describe the bug a clear and concise description of what the bug is.

It's in endless WAL recovery loop and failing on startup probe.

We have tried adding the following to the values.yaml.. but it does not take effect

livenessProbe:
  failureThreshold: 1000
readinessProbe:
  failureThreshold: 1000
startupProbe:
  failureThreshold: 1000

What's your helm version?

v3.11.2

What's your kubectl version?

v1.25.2

Which chart?

kube-prometheus-stack

What's the chart version?

0.57.0

What happened?

Prometheus pod terminated and is unable to startup

What you expected to happen?

No response

How to reproduce it?

No response

Enter the changed values of values.yaml?

No response

Enter the command that you execute and failing/misfunctioning.

kubectl -n rollout restart statefulset prometheus-kube-prometheus-stack-prometheus

Anything else we need to know?

No response

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

xgt001 commented 1 year ago

I think you are running into the following issue: https://github.com/prometheus/prometheus/issues/6934 Allocate slightly higher memory for prometheus during the startup by expanding the limits as a workaround. Something around the lines of

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: kube-prometheus-stack-prometheus
spec:
 ...
  resources:
    limits:
      cpu: 3072m
      memory: 18000Mi  <- is dramatically higher to allow it some breathing room, or remove it entirely
    requests:
      cpu: 2048m
      memory: 4096Mi

nilsbillo commented 1 year ago

This is still an issue. Had the same problem prometheus container getting SIGTERM because it takes to long to start because of reading WAL files. Helm chart does not provide any setting to modify startupProbe which seems to default to 15mins.

Deleting the WAL files "fixed it", but it's not a solution.

vBitza commented 1 year ago

Same issue here, we increased the Memory limit to 32GB and still we couldn't recover Prometheus, had to delete the previous data.

Shahard2 commented 1 year ago

We still face this issue............ what to do to fix it ?

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

a0s commented 10 months ago

The issue is still here

iateya commented 10 months ago

I have the same issue chart name: prometheus chart version: 25.1.0

cody-amtote commented 7 months ago

same issue. the pods limits are in the mid 20Gs. I cant even shell in to the pod to delete wal

SensoryDeprivation commented 6 months ago

I'm experiencing the same issue, in case of redeployment (i.e. config change) pod will often go to CrashLoopBackOff with error:

level=error err="opening storage failed: get segment range: segments are not sequential”

and, reliably the WAL folder will have file starting from 0:

our setup is Prometheus deployed on AKS with azureblob-fuse-premium persistent storage, and quite high resources: limits: cpu: 2 memory: 18Gi requests: cpu: 100m memory: 4Gi

jkroepke commented 6 months ago

Prometheus deployed on AKS with azureblob-fuse-premium

That storage type is not supported from Prometheus.

poornima-krishnasamy commented 6 months ago

You can increase the startupProbe by setting this: maximumStartupDurationSeconds if you are using helm chart

https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/values.yaml#L3993C5-L3993C34

prometheus-community / helm-charts