spotahome / redis-operator

Redis Operator creates/configures/manages high availability redis with sentinel automatic failover atop Kubernetes.
Apache License 2.0
1.51k stars 359 forks source link

Sentinel misconfigured after availability zone failure #601

Closed vramperez closed 1 year ago

vramperez commented 1 year ago

Expected behaviour

After the failure of an availability zone where the redis master was located, after the failover process, all the sentinels should have the new redis master correctly configured.

Actual behaviour

The new sentinel pods deployed in the healthy availability zones as a result of terminating the sentinel pods in the failed availability zone have the redis master misconfigured to 127.0.0.1.

Steps to reproduce the behaviour

Cluster of k8s consisting of 1 master and 3 workers. Each worker in a different availability zone:

Redis deployment consisting of:

Screenshot from 2023-05-16 11-35-32

Simulate the downtime of the availability zone in which the redis master is located (e.g. shut down worker2 because the master is rfr-redis-0).

After the failover process, as there is still a quorum, a new master is elected, and the operation at Redis level is as expected.

Screenshot from 2023-05-16 11-43-00

However, the new sentinel that is deployed, as a consequence of losing a sentinel in the downed zone, has the master misconfigured as 127.0.0.1, instead of 10.42.76.3.

k exec -it rfs-redis-6c5758686f-bmbf2 -- redis-cli -p 26379
Defaulted container "sentinel" out of: sentinel, sentinel-config-copy (init)
127.0.0.1:26379> sentinel get-master-addr-by-name mymaster
1) "127.0.0.1"
2) "6379"

Reading other issues, I understand that this Sentinel has as master 127.0.0.1 as it is the initial configuration. The problem then is in the redis-operator, which does not configure the master correctly in this sentinel (I think because it detects that the previous master is down and does nothing else).

time="2023-05-16T09:36:39Z" level=error msg="Get redis info failed, maybe this node is not ready, pod ip: 10.42.244.15" src="checker.go:113"
time="2023-05-16T09:36:52Z" level=error msg="Get redis info failed, maybe this node is not ready, pod ip: 10.42.244.15" src="checker.go:189"
time="2023-05-16T09:36:52Z" level=info msg="Update pod label, namespace: internal-dev-ci, pod name: rfr-redis-0, labels: map[redisfailovers-role:slave]" service=k8s.pod src="check.go:101"
time="2023-05-16T09:37:04Z" level=error msg="error while getting masterIP : Failed to get info replication while querying redis instance 10.42.244.15" src="check.go:130"
time="2023-05-16T09:37:04Z" level=error msg="Get slave of master failed, maybe this node is not ready, pod ip: 10.42.244.15" src="checker.go:194"
time="2023-05-16T09:37:04Z" level=warning msg="Slave not associated to master: dial tcp 10.42.244.15:6379: connect: network is unreachable" namespace=internal-dev-ci redisfailover=redis src="handler.go:79"
time="2023-05-16T09:37:04Z" level=info msg="Making pod rfr-redis-0 slave of 10.42.76.3" namespace=internal-dev-ci redisfailover=redis service=redis.healer src="checker.go:198"

Is this a design decision of the redis-operator (if it cannot contact a replica, it does nothing else), or is it a bug in the redis-operator?

Although the redis cluster still works correctly, if you use the sentinel service to ask for the master, sometimes it gives you the correct IP of the master, and sometimes it returns 127.0.0.1, depending on which sentinel endpoint responds:

k exec -it rfr-redis-2 -- redis-cli -h rfs-redis -p 26379 
rfs-redis:26379> sentinel get-master-addr-by-name mymaster
1) "127.0.0.1"
2) "6379"

k exec -it rfr-redis-2 -- redis-cli -h rfs-redis -p 26379
rfs-redis:26379> sentinel get-master-addr-by-name mymaster
1) "10.42.76.3"
2) "6379"

NOTE: For the time being, I have prevented this behaviour by configuring a TopologySpreadConstraint with DoNotSchedule in sentinel, to prevent new sentinel pods from being deployed when an availability zone is lost.

Environment

Logs

new_sentinel.txt redis-operator.txt sentinel1.txt sentinel2.txt

MrXianSen commented 1 year ago

I have same issue, is there any progress?

ewener commented 1 year ago

version: v1.2.0-rc1

I added an if condition to the SetMasterOnAll method to avoid this problem. https://github.com/spotahome/redis-operator/blob/56e03270aaec9fd14e6d00988983c1dea387b342/operator/redisfailover/service/heal.go#LL166C4-L166C4

if pod.Status.Phase == v1.PodRunning && pod.DeletionTimestamp == nil {
    r.logger.Debugf("Making pod %s slave of %s", pod.Name, masterIP)
    if err := r.redisClient.MakeSlaveOfWithPort(pod.Status.PodIP, masterIP, port, password); err != nil {
        r.logger.Errorf("Make slave failed, slave ip: %s, master ip: %s, error: %v", pod.Status.PodIP, masterIP, err)
        return err
    }
}
steven0lisa commented 1 year ago

I have save issue, is there any progress?

wl4g commented 1 year ago

I have same issue, is there any progress?

vramperez commented 1 year ago

I have tried this solution, and it seems to work correctly.

More specifically, I added this condition inside the loop of the SetMasterOnAll method of heal.go:

if pod.Status.Phase != v1.PodRunning || pod.DeletionTimestamp != nil {
r.logger.WithField("redisfailover", rf.ObjectMeta.Name).WithField("namespace", rf.ObjectMeta.Namespace).Warnf("Ignoring pod %s because it is not healthy", pod.Name)
  continue
}

The complete method looks like this:

// SetMasterOnAll puts all redis nodes as a slave of a given master
func (r *RedisFailoverHealer) SetMasterOnAll(masterIP string, rf *redisfailoverv1.RedisFailover) error {
    ssp, err := r.k8sService.GetStatefulSetPods(rf.Namespace, GetRedisName(rf))
    if err != nil {
        return err
    }

    password, err := k8s.GetRedisPassword(r.k8sService, rf)
    if err != nil {
        return err
    }

    port := getRedisPort(rf.Spec.Redis.Port)
    for _, pod := range ssp.Items {
        //During this configuration process if there is a new master selected , bailout
        isMaster, err := r.redisClient.IsMaster(masterIP, port, password)
        if err != nil || !isMaster {
            r.logger.WithField("redisfailover", rf.ObjectMeta.Name).WithField("namespace", rf.ObjectMeta.Namespace).Errorf("check master failed maybe this node is not ready(ip changed), or sentinel made a switch: %s", masterIP)
            return err
        } else {
            if pod.Status.Phase != v1.PodRunning || pod.DeletionTimestamp != nil {
                r.logger.WithField("redisfailover", rf.ObjectMeta.Name).WithField("namespace", rf.ObjectMeta.Namespace).Warnf("Ignoring pod %s because it is not healthy", pod.Name)
                continue
            }

            if pod.Status.PodIP == masterIP {
                continue
            }
            r.logger.WithField("redisfailover", rf.ObjectMeta.Name).WithField("namespace", rf.ObjectMeta.Namespace).Infof("Making pod %s slave of %s", pod.Name, masterIP)
            if err := r.redisClient.MakeSlaveOfWithPort(pod.Status.PodIP, masterIP, port, password); err != nil {
                r.logger.WithField("redisfailover", rf.ObjectMeta.Name).WithField("namespace", rf.ObjectMeta.Namespace).Errorf("Make slave failed, slave ip: %s, master ip: %s, error: %v", pod.Status.PodIP, masterIP, err)
                return err
            }

            err = r.setSlaveLabelIfNecessary(rf.Namespace, pod)
            if err != nil {
                return err
            }
        }
    }
    return nil
}

With this, the operator ignores the redis pod that stays in Terminating and configures the new sentinel properly.

If anyone is interested I can do the PR but I need help with the tests.

Mortega5 commented 1 year ago

I have tried this solution, and it seems to work correctly.

More specifically, I added this condition inside the loop of the SetMasterOnAll method of heal.go:

if pod.Status.Phase != v1.PodRunning || pod.DeletionTimestamp != nil {
r.logger.WithField("redisfailover", rf.ObjectMeta.Name).WithField("namespace", rf.ObjectMeta.Namespace).Warnf("Ignoring pod %s because it is not healthy", pod.Name)
  continue
}

I have tested it in my cluster and it has worked.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 45 days with no activity.

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

emahdij commented 1 year ago

I have the same issue, is there any progress?

vramperez commented 1 year ago

No official response, and this issue was closed due to inactivity, although it is not fixed. I have not been able to check if in the last two releases this is fixed, but looking at the changelog, I think not. Can anyone confirm if this issue is still in the latest version?

In my case, we compiled our own image and fixed the bug as I indicated in my previous comment.