grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
4.18k stars 537 forks source link

MimirRolloutStuck Alerts false positive #8877

Open bjorns163 opened 4 months ago

bjorns163 commented 4 months ago

Describe the bug

It seems that the monitoring alerts MimirRolloutStuck sometimes gets triggered incorrectly. The runbook included shows the steps to investigate and there are no issues.

Here are the outputs:

oc get pods | grep -E 'store-gateway|ingester'
mimir-ingester-zone-a-0                    1/1     Running   0             2d2h
mimir-ingester-zone-a-1                    1/1     Running   0             2d3h
mimir-ingester-zone-a-2                    1/1     Running   0             2d3h
mimir-ingester-zone-a-3                    1/1     Running   0             2d3h
mimir-ingester-zone-b-0                    1/1     Running   0             2d3h
mimir-ingester-zone-b-1                    1/1     Running   0             2d2h
mimir-ingester-zone-b-2                    1/1     Running   0             2d3h
mimir-ingester-zone-b-3                    1/1     Running   0             2d3h
mimir-ingester-zone-c-0                    1/1     Running   0             2d2h
mimir-ingester-zone-c-1                    1/1     Running   0             2d3h
mimir-ingester-zone-c-2                    1/1     Running   0             2d2h
mimir-ingester-zone-c-3                    1/1     Running   0             2d3h
mimir-store-gateway-zone-a-0               1/1     Running   0             2d3h
mimir-store-gateway-zone-b-0               1/1     Running   0             2d3h
mimir-store-gateway-zone-c-0               1/1     Running   0             2d2h

Here the statefulset of the store-gateway:

oc describe statefulset mimir-store-gateway | grep -E 'mimir-store-gateway-zone|Events|Pods Status'
Name:               mimir-store-gateway-zone-a
Annotations:        argocd.argoproj.io/tracking-id: mimir:apps/StatefulSet:namespace-mimir/mimir-store-gateway-zone-a
Pods Status:        1 Running / 0 Waiting / 0 Succeeded / 0 Failed
Events:                         <none>
Name:               mimir-store-gateway-zone-b
Annotations:        argocd.argoproj.io/tracking-id: mimir:apps/StatefulSet:namespace-mimir/mimir-store-gateway-zone-b
Pods Status:        1 Running / 0 Waiting / 0 Succeeded / 0 Failed
Events:                         <none>
Name:               mimir-store-gateway-zone-c
Annotations:        argocd.argoproj.io/tracking-id: mimir:apps/StatefulSet:namespace-mimir/mimir-store-gateway-zone-c
Pods Status:        1 Running / 0 Waiting / 0 Succeeded / 0 Failed
Events:                         <none>

Here the statefulset of the ingester:

oc describe statefulset mimir-ingester | grep -E 'mimir-ingester-zone|Events|Pods Status'
Name:               mimir-ingester-zone-a
Annotations:        argocd.argoproj.io/tracking-id: mimir:apps/StatefulSet:namespace-mimir/mimir-ingester-zone-a
Pods Status:        4 Running / 0 Waiting / 0 Succeeded / 0 Failed
Events:                         <none>
Name:               mimir-ingester-zone-b
Annotations:        argocd.argoproj.io/tracking-id: mimir:apps/StatefulSet:namespace-mimir/mimir-ingester-zone-b
Pods Status:        4 Running / 0 Waiting / 0 Succeeded / 0 Failed
Events:                         <none>
Name:               mimir-ingester-zone-c
Annotations:        argocd.argoproj.io/tracking-id: mimir:apps/StatefulSet:namespace-mimir/mimir-ingester-zone-c
Pods Status:        4 Running / 0 Waiting / 0 Succeeded / 0 Failed
Events:                         <none>

Yet the alerts are being triggered: image

If I dig into the metrics and check all values, kube_statefulset_status_update_revision seems to be the issue. The two affected statefulset's have a value of None and not 1, what seems to trigger the alert. image

Looking into the issue a bit, I found a this topic explaining this is down to the update strategy being on delete. it doesn't always happen that the value is none but from time to time after patching the cluster or updating values for the statefulset make the value be None.

Expected behavior

No alert in this case.

Environment

bjorns163 commented 4 months ago

Additionally the kube_statefulset_replicas shows up as none sometimes.

mimir-ingester-zone-a = none
mimir-ingester-zone-b = none
mimir-ingester-zone-c = none
mimir-store-gateway-zone-a = none
mimir-store-gateway-zone-b = none
mimir-store-gateway-zone-c = none