Closed vramperez closed 1 year ago
I have same issue, is there any progress?
version: v1.2.0-rc1
I added an if condition to the SetMasterOnAll method to avoid this problem. https://github.com/spotahome/redis-operator/blob/56e03270aaec9fd14e6d00988983c1dea387b342/operator/redisfailover/service/heal.go#LL166C4-L166C4
if pod.Status.Phase == v1.PodRunning && pod.DeletionTimestamp == nil {
r.logger.Debugf("Making pod %s slave of %s", pod.Name, masterIP)
if err := r.redisClient.MakeSlaveOfWithPort(pod.Status.PodIP, masterIP, port, password); err != nil {
r.logger.Errorf("Make slave failed, slave ip: %s, master ip: %s, error: %v", pod.Status.PodIP, masterIP, err)
return err
}
}
I have save issue, is there any progress?
I have same issue, is there any progress?
I have tried this solution, and it seems to work correctly.
More specifically, I added this condition inside the loop of the SetMasterOnAll
method of heal.go
:
if pod.Status.Phase != v1.PodRunning || pod.DeletionTimestamp != nil {
r.logger.WithField("redisfailover", rf.ObjectMeta.Name).WithField("namespace", rf.ObjectMeta.Namespace).Warnf("Ignoring pod %s because it is not healthy", pod.Name)
continue
}
The complete method looks like this:
// SetMasterOnAll puts all redis nodes as a slave of a given master
func (r *RedisFailoverHealer) SetMasterOnAll(masterIP string, rf *redisfailoverv1.RedisFailover) error {
ssp, err := r.k8sService.GetStatefulSetPods(rf.Namespace, GetRedisName(rf))
if err != nil {
return err
}
password, err := k8s.GetRedisPassword(r.k8sService, rf)
if err != nil {
return err
}
port := getRedisPort(rf.Spec.Redis.Port)
for _, pod := range ssp.Items {
//During this configuration process if there is a new master selected , bailout
isMaster, err := r.redisClient.IsMaster(masterIP, port, password)
if err != nil || !isMaster {
r.logger.WithField("redisfailover", rf.ObjectMeta.Name).WithField("namespace", rf.ObjectMeta.Namespace).Errorf("check master failed maybe this node is not ready(ip changed), or sentinel made a switch: %s", masterIP)
return err
} else {
if pod.Status.Phase != v1.PodRunning || pod.DeletionTimestamp != nil {
r.logger.WithField("redisfailover", rf.ObjectMeta.Name).WithField("namespace", rf.ObjectMeta.Namespace).Warnf("Ignoring pod %s because it is not healthy", pod.Name)
continue
}
if pod.Status.PodIP == masterIP {
continue
}
r.logger.WithField("redisfailover", rf.ObjectMeta.Name).WithField("namespace", rf.ObjectMeta.Namespace).Infof("Making pod %s slave of %s", pod.Name, masterIP)
if err := r.redisClient.MakeSlaveOfWithPort(pod.Status.PodIP, masterIP, port, password); err != nil {
r.logger.WithField("redisfailover", rf.ObjectMeta.Name).WithField("namespace", rf.ObjectMeta.Namespace).Errorf("Make slave failed, slave ip: %s, master ip: %s, error: %v", pod.Status.PodIP, masterIP, err)
return err
}
err = r.setSlaveLabelIfNecessary(rf.Namespace, pod)
if err != nil {
return err
}
}
}
return nil
}
With this, the operator ignores the redis pod that stays in Terminating and configures the new sentinel properly.
If anyone is interested I can do the PR but I need help with the tests.
I have tried this solution, and it seems to work correctly.
More specifically, I added this condition inside the loop of the
SetMasterOnAll
method ofheal.go
:if pod.Status.Phase != v1.PodRunning || pod.DeletionTimestamp != nil { r.logger.WithField("redisfailover", rf.ObjectMeta.Name).WithField("namespace", rf.ObjectMeta.Namespace).Warnf("Ignoring pod %s because it is not healthy", pod.Name) continue }
I have tested it in my cluster and it has worked.
This issue is stale because it has been open for 45 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale.
I have the same issue, is there any progress?
No official response, and this issue was closed due to inactivity, although it is not fixed. I have not been able to check if in the last two releases this is fixed, but looking at the changelog, I think not. Can anyone confirm if this issue is still in the latest version?
In my case, we compiled our own image and fixed the bug as I indicated in my previous comment.
Expected behaviour
After the failure of an availability zone where the redis master was located, after the failover process, all the sentinels should have the new redis master correctly configured.
Actual behaviour
The new sentinel pods deployed in the healthy availability zones as a result of terminating the sentinel pods in the failed availability zone have the redis master misconfigured to 127.0.0.1.
Steps to reproduce the behaviour
Cluster of k8s consisting of 1 master and 3 workers. Each worker in a different availability zone:
Redis deployment consisting of:
Simulate the downtime of the availability zone in which the redis master is located (e.g. shut down worker2 because the master is
rfr-redis-0
).After the failover process, as there is still a quorum, a new master is elected, and the operation at Redis level is as expected.
However, the new sentinel that is deployed, as a consequence of losing a sentinel in the downed zone, has the master misconfigured as 127.0.0.1, instead of 10.42.76.3.
Reading other issues, I understand that this Sentinel has as master 127.0.0.1 as it is the initial configuration. The problem then is in the redis-operator, which does not configure the master correctly in this sentinel (I think because it detects that the previous master is down and does nothing else).
Is this a design decision of the redis-operator (if it cannot contact a replica, it does nothing else), or is it a bug in the redis-operator?
Although the redis cluster still works correctly, if you use the sentinel service to ask for the master, sometimes it gives you the correct IP of the master, and sometimes it returns 127.0.0.1, depending on which sentinel endpoint responds:
NOTE: For the time being, I have prevented this behaviour by configuring a
TopologySpreadConstraint
withDoNotSchedule
in sentinel, to prevent new sentinel pods from being deployed when an availability zone is lost.Environment
Logs
new_sentinel.txt redis-operator.txt sentinel1.txt sentinel2.txt