spotahome / redis-operator

Redis Operator creates/configures/manages high availability redis with sentinel automatic failover atop Kubernetes.
Apache License 2.0
1.51k stars 359 forks source link

Unsatisfiable Affinity rule causes the entire redis cluster to be down, and redis-operator fails to recover even when the CR is reverted to correct Affinity rule #552

Closed hoyhbx closed 1 year ago

hoyhbx commented 1 year ago

Expected behaviour

redis-operator should avoid causing the entire redis cluster to be down when an unsatisfiable Affinity rule is specified.

After the users realize that the cluster is down and revert the Affinity rule back to a satisfiable one, redis-operator should successfully recover the cluster.

Actual behaviour

What is happening? Are all the pieces created? Can you access to the service? The entire redis cluster is down when an unsatisfiable Affinity rule is specified, and even when reverting back, redis-operator is not able to recover the redis cluster.

Steps to reproduce the behaviour

Describe step by step what you've have done to get to this point

  1. Deploy a redis cluster with the example CR
    apiVersion: databases.spotahome.com/v1
    kind: RedisFailover
    metadata:
    name: test-cluster
    spec:
    redis:
    customConfig:
    - maxclients 100
    - hz 50
    - timeout 60
    - tcp-keepalive 60
    - client-output-buffer-limit normal 0 0 0
    - client-output-buffer-limit slave 1000000000 1000000000 0
    - client-output-buffer-limit pubsub 33554432 8388608 60
    exporter:
      enabled: true
    hostNetwork: false
    imagePullPolicy: IfNotPresent
    replicas: 3
  2. Change the Affinity rule of redis to something unsatisfiable at the moment in the cluster
    apiVersion: databases.spotahome.com/v1
    kind: RedisFailover
    metadata:
    name: test-cluster
    spec:
    redis:
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: kubernetes.io/hostname
              operator: In
              values:
              - kind-worker
    customConfig:
    - maxclients 100
    - hz 50
    - timeout 60
    - tcp-keepalive 60
    - client-output-buffer-limit normal 0 0 0
    - client-output-buffer-limit slave 1000000000 1000000000 0
    - client-output-buffer-limit pubsub 33554432 8388608 60
    exporter:
      enabled: true
    hostNetwork: false
    imagePullPolicy: IfNotPresent
    replicas: 3
  3. Observe that all three replicas of redis are down
  4. Revert the CR back to the step 1
  5. Observe that the redis cluster is still down

Environment

How are the pieces configured?

Logs

Please, add the debugging logs. In order to be able to gather them, add -debug flag when running the operator. operator.log

From the log, we found that the redis-operator realized that there is 0 master existing in the cluster, so it tries to assign the oldest pod as the master before updating the pods. However, since all pods are down, redis-operator always fails to establish connection with the pod, so the master assignment always fails. And because redis-operator fails to assign master to pod, it does not roll out the updates to the pods, so the redis pods never become ready. This causes circular dependency, prevent redis-operator from recovering the cluster.

samof76 commented 1 year ago

@hoyhbx I tried to simulate the case... but I always see only one pod is taken down, and that is pending state because of the affinity and the operator has not really proceeded to do anything, which is what is expect. Am I missing something here?

hoyhbx commented 1 year ago

@samof76 , I tried to reproduce it with the latest version of the operator, and it seems that the latest version tries to update the pods one by one. This behavior prevents the entire cluster down from happening.

However, I still observe that after changing the affinity rule back to the correct one, the operator still cannot recover. It is stuck with one pod pending.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 45 days with no activity.

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

hoyhbx commented 1 year ago

Hi @samof76 , I can still encounter this issue. The only difference between the newest version and the old version is that the Redis cluster does not become entirely unavailable, instead just one replica unavailable. The main issue of this is that, even I want to manually recover the cluster by changing the CR, the operator prevents the recovery.