spotahome / redis-operator

Redis Operator creates/configures/manages high availability redis with sentinel automatic failover atop Kubernetes.
Apache License 2.0
1.49k stars 356 forks source link

Connections to sentinel fail when sentinel pods in `Completed` state are present #639

Closed andrewchinnadorai closed 10 months ago

andrewchinnadorai commented 11 months ago

Expected behaviour

Provided there are enough Sentinel pods as defined in RedisFailover users should be able to connect to and retrieve a master from Sentinel

Actual behaviour

Connections to Sentinel fail in the event there are enough running Sentinel pods (i.e 3) but there are additional Sentinel pods present in the cluster which are in a Completed state:

Unable to connect to [redis-sentinel://******************************@rfs-redis.redis.svc.cluster.local?sentinelMasterId=mymaster]

Steps to reproduce the behaviour

This occurs when scheduling the Sentinel pods on preemptible/spot instances, when a node gets terminated due to preemption and a sentinel pod was running on that node a replacement pod will be spun up on a new node and the old pod will shutdown but still be present in the cluster in a Completed state, from what I can gather where IsSentinelRunning calls GetDeploymentPods it will return all the pods, even those in a completed state and then check using AreAllRunning that all returned pods are in a running state, the completed pods will return false and so sentinel/the cluster are not reported as healthy. This is wrong because we have the appropriate number of sentinel pods running and in a healthy state but an additional no longer active pod is still present in the cluster, but this should be disregarded.

Environment

How are the pieces configured?

andrewchinnadorai commented 11 months ago

Proposed a fix in https://github.com/spotahome/redis-operator/pull/640