Provided there are enough Sentinel pods as defined in RedisFailover users should be able to connect to and retrieve a master from Sentinel
Actual behaviour
Connections to Sentinel fail in the event there are enough running Sentinel pods (i.e 3) but there are additional Sentinel pods present in the cluster which are in a Completed state:
Unable to connect to [redis-sentinel://******************************@rfs-redis.redis.svc.cluster.local?sentinelMasterId=mymaster]
Steps to reproduce the behaviour
This occurs when scheduling the Sentinel pods on preemptible/spot instances, when a node gets terminated due to preemption and a sentinel pod was running on that node a replacement pod will be spun up on a new node and the old pod will shutdown but still be present in the cluster in a Completed state, from what I can gather where IsSentinelRunning calls GetDeploymentPods it will return all the pods, even those in a completed state and then check using AreAllRunning that all returned pods are in a running state, the completed pods will return false and so sentinel/the cluster are not reported as healthy. This is wrong because we have the appropriate number of sentinel pods running and in a healthy state but an additional no longer active pod is still present in the cluster, but this should be disregarded.
Expected behaviour
Provided there are enough Sentinel pods as defined in
RedisFailover
users should be able to connect to and retrieve a master from SentinelActual behaviour
Connections to Sentinel fail in the event there are enough running Sentinel pods (i.e 3) but there are additional Sentinel pods present in the cluster which are in a
Completed
state:Steps to reproduce the behaviour
This occurs when scheduling the Sentinel pods on preemptible/spot instances, when a node gets terminated due to preemption and a sentinel pod was running on that node a replacement pod will be spun up on a new node and the old pod will shutdown but still be present in the cluster in a
Completed
state, from what I can gather whereIsSentinelRunning
callsGetDeploymentPods
it will return all the pods, even those in a completed state and then check usingAreAllRunning
that all returned pods are in a running state, the completed pods will returnfalse
and so sentinel/the cluster are not reported as healthy. This is wrong because we have the appropriate number of sentinel pods running and in a healthy state but an additional no longer active pod is still present in the cluster, but this should be disregarded.Environment
How are the pieces configured?