Closed hoyhbx closed 1 year ago
@hoyhbx I tried to simulate the case... but I always see only one pod is taken down, and that is pending state because of the affinity and the operator has not really proceeded to do anything, which is what is expect. Am I missing something here?
@samof76 , I tried to reproduce it with the latest version of the operator, and it seems that the latest version tries to update the pods one by one. This behavior prevents the entire cluster down from happening.
However, I still observe that after changing the affinity rule back to the correct one, the operator still cannot recover. It is stuck with one pod pending.
This issue is stale because it has been open for 45 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale.
Hi @samof76 , I can still encounter this issue. The only difference between the newest version and the old version is that the Redis cluster does not become entirely unavailable, instead just one replica unavailable. The main issue of this is that, even I want to manually recover the cluster by changing the CR, the operator prevents the recovery.
Expected behaviour
redis-operator should avoid causing the entire redis cluster to be down when an unsatisfiable Affinity rule is specified.
After the users realize that the cluster is down and revert the Affinity rule back to a satisfiable one, redis-operator should successfully recover the cluster.
Actual behaviour
What is happening? Are all the pieces created? Can you access to the service? The entire redis cluster is down when an unsatisfiable Affinity rule is specified, and even when reverting back, redis-operator is not able to recover the redis cluster.
Steps to reproduce the behaviour
Describe step by step what you've have done to get to this point
Environment
How are the pieces configured?
Logs
Please, add the debugging logs. In order to be able to gather them, add
-debug
flag when running the operator. operator.logFrom the log, we found that the redis-operator realized that there is 0 master existing in the cluster, so it tries to assign the oldest pod as the master before updating the pods. However, since all pods are down, redis-operator always fails to establish connection with the pod, so the master assignment always fails. And because redis-operator fails to assign master to pod, it does not roll out the updates to the pods, so the redis pods never become ready. This causes circular dependency, prevent redis-operator from recovering the cluster.