spotahome / redis-operator

Redis Operator creates/configures/manages high availability redis with sentinel automatic failover atop Kubernetes.
Apache License 2.0
1.48k stars 355 forks source link

Bug: Operator did not set the master's label correctly after the master went offline #674

Closed wusendong closed 5 months ago

wusendong commented 7 months ago

Expected behaviour

Operator set the new master's label correctly after the master went offline, so that the master service can forward the request to new master.

Actual behaviour

Operator set the new master's label correctly after the master went offline

Steps to reproduce the behaviour

Prepare the enviroment: Node1: operator1, master, sentinel1 Node2: operator2, slave1,sentinel2 Node3: slave2,sentinel3

Then, shutdown the Node1, we will see the issue happend: Although sentinel has correctly elected a new master, but Operator will also try to make the old master to slave of new master, but failed (timeout as the log show) and not continue to lable the new master's pod, so that the master service can forward the request to new master.

But, if we not shutdown the Node1, but delete the master pod, the operator will work as expected.

Environment

How are the pieces configured?

Logs

time="2023-11-15T01:50:23Z" level=info msg="Listening on :9710 for metrics exposure on URL /metrics" src="asm_amd64.s:1598"
2023-11-15T09:50:23.025713542+08:00 time="2023-11-15T01:50:23Z" level=info msg="running in leader election mode, waiting to acquire leadership..." leader-election-id=redis-operator/redis-failover-lease operator=redisfailover source-service=kooper/leader-election src="controller.go:231"
I1115 01:50:23.026872       1 leaderelection.go:245] attempting to acquire leader lease redis-operator/redis-failover-lease...
I1115 01:50:40.062229       1 leaderelection.go:255] successfully acquired lease redis-operator/redis-failover-lease
2023-11-15T09:50:40.062883098+08:00 time="2023-11-15T01:50:40Z" level=info msg="lead acquire, starting..." leader-election-id=redis-operator/redis-failover-lease operator=redisfailover source-service=kooper/leader-election src="asm_amd64.s:1598"
2023-11-15T09:50:40.062890471+08:00 time="2023-11-15T01:50:40Z" level=info msg="starting controller" controller-id=redisfailover operator=redisfailover service=kooper.controller src="controller.go:232"
time="2023-11-15T01:51:00Z" level=error msg="Get redis info failed, maybe this node is not ready, pod ip: 10.42.3.73" src="checker.go:113"
2023-11-15T09:51:00.373556047+08:00 time="2023-11-15T01:51:00Z" level=info msg="Update pod label, namespace: prod, pod name: rfr-redisfailover-persistent-1, labels: map[redisfailovers-role:slave]" service=k8s.pod src="check.go:102"
2023-11-15T09:51:20.478215616+08:00 time="2023-11-15T01:51:20Z" level=error msg="error while getting masterIP : Failed to get info replication while querying redis instance 10.42.3.73" src="check.go:131"
time="2023-11-15T01:51:20Z" level=error msg="Get slave of master failed, maybe this node is not ready, pod ip: 10.42.3.73" src="checker.go:194"
2023-11-15T09:51:20.478247383+08:00 time="2023-11-15T01:51:20Z" level=warning msg="Slave not associated to master: dial tcp 10.42.3.73:6379: i/o timeout" namespace=prod redisfailover=redisfailover-persistent src="handler.go:79"
time="2023-11-15T01:51:20Z" level=info msg="Making pod rfr-redisfailover-persistent-0 slave of 10.42.0.77" namespace=prod redisfailover=redisfailover-persistent service=redis.healer src="checker.go:198"
time="2023-11-15T01:51:20Z" level=info msg="Making pod rfr-redisfailover-persistent-1 slave of 10.42.0.77" namespace=prod redisfailover=redisfailover-persistent service=redis.healer src="checker.go:198"
2023-11-15T09:51:40.601993218+08:00 time="2023-11-15T01:51:40Z" level=error msg="Make slave failed, slave ip: 10.42.3.73, master ip: 10.42.0.77, error: dial tcp 10.42.3.73:6379: i/o timeout" namespace=prod redisfailover=redisfailover-persistent service=redis.healer src="checker.go:198"
2023-11-15T09:51:40.602040884+08:00 time="2023-11-15T01:51:40Z" level=error msg="error on object processing: dial tcp 10.42.3.73:6379: i/o timeout" controller-id=redisfailover object-key=prod/redisfailover-persistent operator=redisfailover service=kooper.controller src="controller.go:282"
github-actions[bot] commented 6 months ago

This issue is stale because it has been open for 45 days with no activity.

github-actions[bot] commented 5 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.