As described in #639 when scheduling the Sentinel pods on preemptible/spot instances, if a node gets terminated due to preemption and a sentinel pod was running on that node a replacement pod will be spun up on a new node and the old pod will shutdown but still be present in the cluster in a Completed state.
The checks performed by IsSentinelRunning get a list of all pods for the deployment indiscriminately and then proceed to check that each pod returned is in a running state. In our case this may mean that we have pods in a Completed state which will fail this check, even though we have the desired number of sentinel pods running in a healthy state. Because of this the cluster & Sentinel are deemed to not be healthy.
There are a couple of ways this could be fixed but in this PR I've done what I think is the simplest fix which is to alter the AreAllRunning function to take a expectedRunningPods integer as a parameter and then altered the loop to increment a runningPods counter in the event the conditional inside the loop if not met, or skip to the next item in the event the conditional is not met therefore not incrementing the runningPods counter. We can then return the boolean value of checking if runningPods is equal to expectedRunningPods. By doing this we no longer care about any other cluster/sentinel pods which may be in a non-running state in the cluster, as long as the expected number of pods are running.
Fixes https://github.com/spotahome/redis-operator/issues/639
As described in #639 when scheduling the Sentinel pods on preemptible/spot instances, if a node gets terminated due to preemption and a sentinel pod was running on that node a replacement pod will be spun up on a new node and the old pod will shutdown but still be present in the cluster in a Completed state.
The checks performed by
IsSentinelRunning
get a list of all pods for the deployment indiscriminately and then proceed to check that each pod returned is in a running state. In our case this may mean that we have pods in aCompleted
state which will fail this check, even though we have the desired number of sentinel pods running in a healthy state. Because of this the cluster & Sentinel are deemed to not be healthy.There are a couple of ways this could be fixed but in this PR I've done what I think is the simplest fix which is to alter the
AreAllRunning
function to take aexpectedRunningPods
integer as a parameter and then altered the loop to increment arunningPods
counter in the event the conditional inside the loop if not met, or skip to the next item in the event the conditional is not met therefore not incrementing therunningPods
counter. We can then return the boolean value of checking ifrunningPods
is equal toexpectedRunningPods
. By doing this we no longer care about any other cluster/sentinel pods which may be in a non-running state in the cluster, as long as the expected number of pods are running.