Then, shutdown the Node1, we will see the issue happend: Although sentinel has correctly elected a new master, but Operator will also try to make the old master to slave of new master, but failed (timeout as the log show) and not continue to lable the new master's pod, so that the master service can forward the request to new master.
But, if we not shutdown the Node1, but delete the master pod, the operator will work as expected.
Environment
How are the pieces configured?
Redis Operator version: v1.2.4
Kubernetes version: v1.28.3 +rke2r1
Kubernetes configuration used (eg: Is RBAC active?)
Logs
time="2023-11-15T01:50:23Z" level=info msg="Listening on :9710 for metrics exposure on URL /metrics" src="asm_amd64.s:1598"
2023-11-15T09:50:23.025713542+08:00 time="2023-11-15T01:50:23Z" level=info msg="running in leader election mode, waiting to acquire leadership..." leader-election-id=redis-operator/redis-failover-lease operator=redisfailover source-service=kooper/leader-election src="controller.go:231"
I1115 01:50:23.026872 1 leaderelection.go:245] attempting to acquire leader lease redis-operator/redis-failover-lease...
I1115 01:50:40.062229 1 leaderelection.go:255] successfully acquired lease redis-operator/redis-failover-lease
2023-11-15T09:50:40.062883098+08:00 time="2023-11-15T01:50:40Z" level=info msg="lead acquire, starting..." leader-election-id=redis-operator/redis-failover-lease operator=redisfailover source-service=kooper/leader-election src="asm_amd64.s:1598"
2023-11-15T09:50:40.062890471+08:00 time="2023-11-15T01:50:40Z" level=info msg="starting controller" controller-id=redisfailover operator=redisfailover service=kooper.controller src="controller.go:232"
time="2023-11-15T01:51:00Z" level=error msg="Get redis info failed, maybe this node is not ready, pod ip: 10.42.3.73" src="checker.go:113"
2023-11-15T09:51:00.373556047+08:00 time="2023-11-15T01:51:00Z" level=info msg="Update pod label, namespace: prod, pod name: rfr-redisfailover-persistent-1, labels: map[redisfailovers-role:slave]" service=k8s.pod src="check.go:102"
2023-11-15T09:51:20.478215616+08:00 time="2023-11-15T01:51:20Z" level=error msg="error while getting masterIP : Failed to get info replication while querying redis instance 10.42.3.73" src="check.go:131"
time="2023-11-15T01:51:20Z" level=error msg="Get slave of master failed, maybe this node is not ready, pod ip: 10.42.3.73" src="checker.go:194"
2023-11-15T09:51:20.478247383+08:00 time="2023-11-15T01:51:20Z" level=warning msg="Slave not associated to master: dial tcp 10.42.3.73:6379: i/o timeout" namespace=prod redisfailover=redisfailover-persistent src="handler.go:79"
time="2023-11-15T01:51:20Z" level=info msg="Making pod rfr-redisfailover-persistent-0 slave of 10.42.0.77" namespace=prod redisfailover=redisfailover-persistent service=redis.healer src="checker.go:198"
time="2023-11-15T01:51:20Z" level=info msg="Making pod rfr-redisfailover-persistent-1 slave of 10.42.0.77" namespace=prod redisfailover=redisfailover-persistent service=redis.healer src="checker.go:198"
2023-11-15T09:51:40.601993218+08:00 time="2023-11-15T01:51:40Z" level=error msg="Make slave failed, slave ip: 10.42.3.73, master ip: 10.42.0.77, error: dial tcp 10.42.3.73:6379: i/o timeout" namespace=prod redisfailover=redisfailover-persistent service=redis.healer src="checker.go:198"
2023-11-15T09:51:40.602040884+08:00 time="2023-11-15T01:51:40Z" level=error msg="error on object processing: dial tcp 10.42.3.73:6379: i/o timeout" controller-id=redisfailover object-key=prod/redisfailover-persistent operator=redisfailover service=kooper.controller src="controller.go:282"
Expected behaviour
Operator set the new master's label correctly after the master went offline, so that the master service can forward the request to new master.
Actual behaviour
Operator set the new master's label correctly after the master went offline
Steps to reproduce the behaviour
Prepare the enviroment: Node1: operator1, master, sentinel1 Node2: operator2, slave1,sentinel2 Node3: slave2,sentinel3
Then, shutdown the Node1, we will see the issue happend: Although sentinel has correctly elected a new master, but Operator will also try to make the old master to slave of new master, but failed (timeout as the log show) and not continue to lable the new master's pod, so that the master service can forward the request to new master.
But, if we not shutdown the Node1, but delete the master pod, the operator will work as expected.
Environment
How are the pieces configured?
Logs