grafana / rollout-operator

Kubernetes Rollout Operator
Apache License 2.0
130 stars 17 forks source link

Operator assumes all pods are ready when there are no pods at all #126

Closed colega closed 7 months ago

colega commented 7 months ago

We're currently checking hasStatefulSetNotReadyPods before considering the zone reconciled:

https://github.com/grafana/rollout-operator/blob/0b99175d41e44ae5b3b6f871e37048c50cb42500/pkg/controller/controller.go#L382-L410

However, we've seen a case where all pods of a statefulset were deleted, but no new pods were created yet. This caused the listNotReadyPodsByStatefulSet method return no pods at all, as there were no pods in any state.

This caused an outage, as next zone was terminated immediately, before no pods were available in the first one.

I think that the code should check that there are at least as many pods as replicas desired by the statefulset, and then it should check whether some of them are unready.