spotahome / redis-operator

Redis Operator creates/configures/manages high availability redis with sentinel automatic failover atop Kubernetes.
Apache License 2.0
1.49k stars 356 forks source link

Labels of slave rfr pod not changed during failover #608

Closed ouyangde closed 11 months ago

ouyangde commented 1 year ago

I use this label to select master pod serving outsize of the cluster.

Expected behaviour

when master pod rfr-0 shutdown, master role switched to pod rfr-1, label should change to redisfailovers-role=master.

Actual behaviour

when master pod rfr-0 shutdown, master role switched to pod rfr-1, label remains redisfailovers-role=slave.

Steps to reproduce the behaviour

  1. deploy redisfailover with 3 sentinel and 2 redis, and enabled auth.
  2. when these pods are ready, shutdown the node where rfr-0 assigned to.
  3. wait until master switched,
  4. then verify labels of rfr-1

Environment

How are the pieces configured?

Logs

NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES rfr-redisfailover-0 1/1 Terminating 0 32m 1.1.169.167 devnode6 rfr-redisfailover-1 1/1 Running 0 32m 1.1.209.126 devnode4 rfs-redisfailover-7cd5d4998-6dk5l 1/1 Terminating 0 32m 1.1.169.165 devnode6 rfs-redisfailover-7cd5d4998-hgx8z 0/1 Running 0 38s 1.1.209.71 devnode4 rfs-redisfailover-7cd5d4998-hllfn 1/1 Running 0 32m 1.1.209.125 devnode4 rfs-redisfailover-7cd5d4998-nmmkb 1/1 Running 0 32m 1.1.162.94 devnode3

[root@devnode1 ~]# kubectl -n dtg-test logs --tail=10 -f rfs-redisfailover-7cd5d4998-hllfn 1:X 29 May 2023 09:57:58.347 # +set master mymaster 1.1.169.167 6379 down-after-milliseconds 5000 1:X 29 May 2023 09:57:58.351 # +set master mymaster 1.1.169.167 6379 failover-timeout 10000 1:X 29 May 2023 09:58:17.480 # +sdown master mymaster 1.1.169.167 6379 1:X 29 May 2023 09:58:17.480 # +sdown sentinel 6e481ec7294ffa53156303813529384034cca4d5 1.1.169.165 26379 @ mymaster 1.1.169.167 6379 1:X 29 May 2023 09:58:17.620 # +new-epoch 1 1:X 29 May 2023 09:58:17.623 # +vote-for-leader 6649e230b6770426a585a5f7af87729ccd0bfebf 1 1:X 29 May 2023 09:58:18.289 # +config-update-from sentinel 6649e230b6770426a585a5f7af87729ccd0bfebf 1.1.162.94 26379 @ mymaster 1.1.169.167 6379 1:X 29 May 2023 09:58:18.289 # +switch-master mymaster 1.1.169.167 6379 1.1.209.126 6379 1:X 29 May 2023 09:58:18.290 * +slave slave 1.1.169.167:6379 1.1.169.167 6379 @ mymaster 1.1.209.126 6379 1:X 29 May 2023 09:58:23.336 # +sdown slave 1.1.169.167:6379 1.1.169.167 6379 @ mymaster 1.1.209.126 6379

[root@devnode1 ~]# kubectl -n dtg-ops logs redis-operator-7cf586fd49-xpk4v time="2023-05-29T10:00:59Z" level=info msg="Listening on :9710 for metrics exposure on URL /metrics" src="asm_amd64.s:1594" time="2023-05-29T10:00:59Z" level=info msg="running in leader election mode, waiting to acquire leadership..." leader-election-id=dtg-ops/redis-failover-lease operator=redisfailover source-service=kooper/leader-election src="controller.go:228" I0529 10:00:59.641145 1 leaderelection.go:248] attempting to acquire leader lease dtg-ops/redis-failover-lease... I0529 10:01:17.178453 1 leaderelection.go:258] successfully acquired lease dtg-ops/redis-failover-lease time="2023-05-29T10:01:17Z" level=info msg="lead acquire, starting..." leader-election-id=dtg-ops/redis-failover-lease operator=redisfailover source-service=kooper/leader-election src="asm_amd64.s:1594" time="2023-05-29T10:01:17Z" level=info msg="starting controller" controller-id=redisfailover operator=redisfailover service=kooper.controller src="controller.go:229" time="2023-05-29T10:17:47Z" level=warning msg="Slave not associated to master: slave 1.1.108.150 don't have the master 1.1.209.126, has 127.0.0.1" namespace=dtg-test redisfailover=redisfailover src="handler.go:79" time="2023-05-29T10:17:47Z" level=info msg="Making pod rfr-redisfailover-0 slave of 1.1.209.126" namespace=dtg-test redisfailover=redisfailover service=redis.healer src="checker.go:198" time="2023-05-29T10:17:47Z" level=warning msg="Fixing sentinel not monitoring expected master: sentinel monitoring 127.0.0.1:6379 instead 1.1.209.126:6379" namespace=dtg-test redisfailover=redisfailover src="handler.go:79" time="2023-05-29T10:17:47Z" level=warning msg="Sentinel 1.1.209.71 mismatch number of sentinels in memory. resetting" namespace=dtg-test redisfailover=redisfailover src="checker.go:230" time="2023-05-29T10:17:47Z" level=warning msg="Sentinel 1.1.209.71 mismatch number of expected slaves in memory. resetting" namespace=dtg-test redisfailover=redisfailover src="checker.go:230" time="2023-05-29T10:18:17Z" level=info msg="Update pod label, namespace: dtg-test, pod name: rfr-redisfailover-1, labels: map[redisfailovers-role:master]" service=k8s.pod src="check.go:92" time="2023-05-29T10:18:17Z" level=warning msg="Sentinel 1.1.209.125 mismatch number of sentinels in memory. resetting" namespace=dtg-test redisfailover=redisfailover src="checker.go:230" time="2023-05-29T10:18:17Z" level=warning msg="Sentinel 1.1.162.94 mismatch number of sentinels in memory. resetting" namespace=dtg-test redisfailover=redisfailover src="checker.go:230" time="2023-05-29T10:18:17Z" level=warning msg="Sentinel 1.1.209.125 mismatch number of expected slaves in memory. resetting" namespace=dtg-test redisfailover=redisfailover src="checker.go:230" time="2023-05-29T10:18:17Z" level=warning msg="Sentinel 1.1.162.94 mismatch number of expected slaves in memory. resetting" namespace=dtg-test redisfailover=redisfailover src="checker.go:230" time="2023-05-29T10:19:17Z" level=info msg="Update pod label, namespace: dtg-test, pod name: rfr-redisfailover-0, labels: map[redisfailovers-role:master]" service=k8s.pod src="check.go:92" time="2023-05-29T10:19:17Z" level=warning msg="Slave not associated to master: slave 1.1.169.158 don't have the master 1.1.108.150, has 127.0.0.1" namespace=dtg-test redisfailover=redisfailover src="handler.go:79" time="2023-05-29T10:19:17Z" level=info msg="Making pod rfr-redisfailover-1 slave of 1.1.108.150" namespace=dtg-test redisfailover=redisfailover service=redis.healer src="checker.go:198" time="2023-05-29T10:19:47Z" level=warning msg="Slave not associated to master: slave 1.1.169.169 don't have the master 1.1.108.150, has 127.0.0.1" namespace=dtg-test redisfailover=redisfailover src="handler.go:79" time="2023-05-29T10:19:47Z" level=info msg="Making pod rfr-redisfailover-1 slave of 1.1.108.150" namespace=dtg-test redisfailover=redisfailover service=redis.healer src="checker.go:198" time="2023-05-29T10:19:47Z" level=warning msg="Sentinel 1.1.209.71 mismatch number of expected slaves in memory. resetting" namespace=dtg-test redisfailover=redisfailover src="checker.go:230" time="2023-05-29T10:20:17Z" level=warning msg="Sentinel 1.1.209.125 mismatch number of expected slaves in memory. resetting" namespace=dtg-test redisfailover=redisfailover src="checker.go:230" time="2023-05-29T10:20:17Z" level=warning msg="Sentinel 1.1.162.94 mismatch number of expected slaves in memory. resetting" namespace=dtg-test redisfailover=redisfailover src="checker.go:230" time="2023-05-29T10:47:47Z" level=warning msg="Slave not associated to master: slave 1.1.108.157 don't have the master 1.1.169.169, has 127.0.0.1" namespace=dtg-test redisfailover=redisfailover src="handler.go:79" time="2023-05-29T10:47:47Z" level=info msg="Making pod rfr-redisfailover-0 slave of 1.1.169.169" namespace=dtg-test redisfailover=redisfailover service=redis.healer src="checker.go:198" time="2023-05-29T10:48:17Z" level=info msg="Update pod label, namespace: dtg-test, pod name: rfr-redisfailover-1, labels: map[redisfailovers-role:master]" service=k8s.pod src="check.go:92" time="2023-05-29T10:48:17Z" level=warning msg="Sentinel 1.1.209.71 mismatch number of expected slaves in memory. resetting" namespace=dtg-test redisfailover=redisfailover src="checker.go:230" time="2023-05-29T10:48:17Z" level=warning msg="Sentinel 1.1.209.125 mismatch number of expected slaves in memory. resetting" namespace=dtg-test redisfailover=redisfailover src="checker.go:230" time="2023-05-29T10:48:17Z" level=warning msg="Sentinel 1.1.162.94 mismatch number of expected slaves in memory. resetting" namespace=dtg-test redisfailover=redisfailover src="checker.go:230" time="2023-05-29T10:49:17Z" level=info msg="Update pod label, namespace: dtg-test, pod name: rfr-redisfailover-0, labels: map[redisfailovers-role:master]" service=k8s.pod src="check.go:92" time="2023-05-29T10:49:17Z" level=warning msg="Slave not associated to master: slave 1.1.169.137 don't have the master 1.1.108.157, has 127.0.0.1" namespace=dtg-test redisfailover=redisfailover src="handler.go:79" time="2023-05-29T10:49:17Z" level=info msg="Making pod rfr-redisfailover-1 slave of 1.1.108.157" namespace=dtg-test redisfailover=redisfailover service=redis.healer src="checker.go:198" time="2023-05-29T10:49:47Z" level=warning msg="Sentinel 1.1.209.71 mismatch number of expected slaves in memory. resetting" namespace=dtg-test redisfailover=redisfailover src="checker.go:230" time="2023-05-29T10:49:47Z" level=warning msg="Sentinel 1.1.209.125 mismatch number of expected slaves in memory. resetting" namespace=dtg-test redisfailover=redisfailover src="checker.go:230" time="2023-05-29T10:49:47Z" level=warning msg="Sentinel 1.1.162.94 mismatch number of expected slaves in memory. resetting" namespace=dtg-test redisfailover=redisfailover src="checker.go:230"

github-actions[bot] commented 12 months ago

This issue is stale because it has been open for 45 days with no activity.

github-actions[bot] commented 11 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

hashemi-soroush commented 11 months ago

Unfortunately, I won't have access to a Kubernetes cluster for a few weeks, and I don't assume KIND could simulate node shutdown well enough for reproducing this issue. I hope @mrtaalebi can help us reproduce it since we are relying heavily on the label-change mechanism for this #627 that we are currently using in production.

I read the logs you provided very carefully and don't think what is happening has anything to do with the labels. In this comment, I will explain why, and in another, I will explain my guess as to what exactly is happening in the logs you sent.

The redisfailover-role label has no effect on the working of Redis, Sentinel, or the operator, so if you set it wrong, or remove it altogether, nothing would be affected. If you look for all the occurrences of redisfailover-role label in the codebase, you see that it's not used anywhere, except for the methods that set it.

ouyangde commented 11 months ago

Add a little more information, hoping to provide more clues: If we delete pods (rfr-0 or rfr-1) alternately, the labels switch as expected. However, this is not the case when shutting down nodes.

hashemi-soroush commented 11 months ago

So, as I said, this is the comment where I will explain my guess as to what exactly is causing this issue.

This is a very long comment. To make it more readable, it is divided into two sections: the first one explains what is happening according to the logs, and the second one is my guess as to what is going on according to the logs.

Let's read the logs more carefully

The first section, which I call the pod status log, shows that with the shutdown of node devnode6 the master instance of Redis and one Sentinel instance is being terminated and replaced. The new Sentinel's IP is 1.1.209.71 and the other Redis instance IP is 1.1.209.126 (we will need these IPs later).

Now, in the Sentinel log, we see that both the terminating pods have sent signals to Sentinels, notifying them of their incoming termination:

1:X 29 May 2023 09:58:17.480 # +sdown master mymaster 1.1.169.167 6379
1:X 29 May 2023 09:58:17.480 # +sdown sentinel 6e481ec7294ffa53156303813529384034cca4d5 1.1.169.165 26379 @ mymaster 1.1.169.167 6379

Since the current master is terminating, Sentinels elect another master to safely transfer the master role before the terminating master actually terminates:

1:X 29 May 2023 09:58:17.620 # +new-epoch 1
1:X 29 May 2023 09:58:17.623 # +vote-for-leader 6649e230b6770426a585a5f7af87729ccd0bfebf 1
1:X 29 May 2023 09:58:18.289 # +config-update-from sentinel 6649e230b6770426a585a5f7af87729ccd0bfebf 1.1.162.94 26379 @ mymaster 1.1.169.167 6379
1:X 29 May 2023 09:58:18.289 # +switch-master mymaster 1.1.169.167 6379 1.1.209.126 6379
1:X 29 May 2023 09:58:18.290 * +slave slave 1.1.169.167:6379 1.1.169.167 6379 @ mymaster 1.1.209.126 6379

Finally, the old master, which is a slave now, terminates. Note that all this happened before 9:59.

1:X 29 May 2023 09:58:23.336 # +sdown slave 1.1.169.167:6379 1.1.169.167 6379 @ mymaster 1.1.209.126 6379

Now, to the operator's log. The first few lines are about the operator initialization process and have nothing to do with the issue. Note that the operator finished initialization at 10:01, which is after the Sentinels have elected a new master. In my experience, this is fine and the operator works fine in this case too.

... (omitted lines)
time="2023-05-29T10:01:17Z" level=info msg="starting controller" controller-id=redisfailover operator=redisfailover service=kooper.controller src="controller.go:229"

At 10:17, 19 minutes after Sentinels elected a new master (!!!), the operator changes that master's redisfailover-role label to master and connects the new Redis instance (IP: 1.1.108.150) to it. I wonder what happened during that 19 minutes. :thinking:

time="2023-05-29T10:17:47Z" level=warning msg="Slave not associated to master: slave 1.1.108.150 don't have the master 1.1.209.126, has 127.0.0.1" namespace=dtg-test redisfailover=redisfailover src="handler.go:79"
time="2023-05-29T10:17:47Z" level=info msg="Making pod rfr-redisfailover-0 slave of 1.1.209.126" namespace=dtg-test redisfailover=redisfailover service=redis.healer src="checker.go:198"
... (3 lines omitted, we will get back to them soon)
time="2023-05-29T10:18:17Z" level=info msg="Update pod label, namespace: dtg-test, pod name: rfr-redisfailover-1, labels: map[redisfailovers-role:master]" service=k8s.pod src="check.go:92"

Right after, the operator configures the new Sentinel, connects it to the master, and configures the other Sentinels to expect a new Sentinel and a new slave to join.

time="2023-05-29T10:17:47Z" level=warning msg="Fixing sentinel not monitoring expected master: sentinel monitoring 127.0.0.1:6379 instead 1.1.209.126:6379" namespace=dtg-test redisfailover=redisfailover src="handler.go:79"
time="2023-05-29T10:17:47Z" level=warning msg="Sentinel 1.1.209.71 mismatch number of sentinels in memory. resetting" namespace=dtg-test redisfailover=redisfailover src="checker.go:230"
time="2023-05-29T10:17:47Z" level=warning msg="Sentinel 1.1.209.71 mismatch number of expected slaves in memory. resetting" namespace=dtg-test redisfailover=redisfailover src="checker.go:230"
... (1 line omitted, discussed in the previous code block)
time="2023-05-29T10:18:17Z" level=warning msg="Sentinel 1.1.209.125 mismatch number of sentinels in memory. resetting" namespace=dtg-test redisfailover=redisfailover src="checker.go:230"
time="2023-05-29T10:18:17Z" level=warning msg="Sentinel 1.1.162.94 mismatch number of sentinels in memory. resetting" namespace=dtg-test redisfailover=redisfailover src="checker.go:230"
time="2023-05-29T10:18:17Z" level=warning msg="Sentinel 1.1.209.125 mismatch number of expected slaves in memory. resetting" namespace=dtg-test redisfailover=redisfailover src="checker.go:230"
time="2023-05-29T10:18:17Z" level=warning msg="Sentinel 1.1.162.94 mismatch number of expected slaves in memory. resetting" namespace=dtg-test redisfailover=redisfailover src="checker.go:230"

From here on, for some reason, at least one Redis instance goes down every 30 seconds. For example, here, the instance with IP 1.1.108.150, which was the slave before, is now the master, and another instance with IP

time="2023-05-29T10:19:17Z" level=info msg="Update pod label, namespace: dtg-test, pod name: rfr-redisfailover-0, labels: map[redisfailovers-role:master]" service=k8s.pod src="check.go:92"
time="2023-05-29T10:19:17Z" level=warning msg="Slave not associated to master: slave 1.1.169.158 don't have the master 1.1.108.150, has 127.0.0.1" namespace=dtg-test redisfailover=redisfailover src="handler.go:79"
time="2023-05-29T10:19:17Z" level=info msg="Making pod rfr-redisfailover-1 slave of 1.1.108.150" namespace=dtg-test redisfailover=redisfailover service=redis.healer src="checker.go:198"

And here, both instances from the previous log are gone and now we have two new instances with IPs 1.1.169.169 and 1.1.108.157.

time="2023-05-29T10:47:47Z" level=warning msg="Slave not associated to master: slave 1.1.108.157 don't have the master 1.1.169.169, has 127.0.0.1" namespace=dtg-test redisfailover=redisfailover src="handler.go:79"
time="2023-05-29T10:47:47Z" level=info msg="Making pod rfr-redisfailover-0 slave of 1.1.169.169" namespace=dtg-test redisfailover=redisfailover service=redis.healer src="checker.go:198"
time="2023-05-29T10:48:17Z" level=info msg="Update pod label, namespace: dtg-test, pod name: rfr-redisfailover-1, labels: map[redisfailovers-role:master]" service=k8s.pod src="check.go:92"

What is going on?

First question: Why are we seeing these logs every 30 seconds?

The Kooper framework, which this operator is written with, has a periodical check mechanism that collects all the watched CRs and gives them to the Handler, which is basically the entire logic of this operator. The default periods are 30 seconds, so that's why we see these logs every 30 seconds.

Why do Redis instances keep terminating?

A pod is either terminated by Kubernetes or shut down by itself.

Does Kubernetes keep terminating Redis pods?

There are a few scenarios where Kubernetes terminates a pod, but only one of them happens this often in a healthy Kubernetes cluster: liveness probe failure. The default config of the liveness probe this operator sets on Redis instances is as follows (from operator/redisfailover/service/generator_test.go):

InitialDelaySeconds: 30
TimeoutSeconds:      5
FailureThreshold:    6
PeriodSeconds:       15

It says Kubernetes should wait for 30 seconds after a pod comes to life to even start checking its liveness. Since we observed that pods are coming down in less than 30 seconds, we can be sure that it's not Kubernetes that is doing it.

So, why do pods keep shutting themselves down?

The best way to find out is to check the logs of the shutdown pods, but since I will not have access to a real Kubernetes cluster anytime soon, I ask for your help. Could you please share with us the logs of a few terminated pods? My guess is that it has something to do with a corrupted AOF file from the Redis instance that was terminated because of node shutdown.

ouyangde commented 11 months ago

I don't think the issue is related to pods repeatedly shutting down. In fact, the missing "19 minutes" provides a clue. Please note that the operator finished initialization at 10:01, which was after the Redis pod had entered the Terminating status. This happened because the operator itself was on the same node as the Redis pod and got rescheduled when the Redis pod was terminating during node shutdown.

I attempted to place the operator on a separate node, and everything worked fine. Additionally, when I manually scaled down the operator and shutdown a node where the master Redis pod was assigned, I waited for the status to become Terminating, then scaled up the operator, and the issue was reproduced.

ouyangde commented 11 months ago

Those logs are here only shows that I waited for at least 19 minutes to operate it again. Perhaps I restarted the shut-down node, I don't remember the details very precisely.

hashemi-soroush commented 11 months ago

Thank you for sharing additional details of your experiment.

As I said, the redisfailover-role label does not affect the working of Redis, Sentinels, or the operator. So, it's not the issue. On the other hand, the AOF corruption during a shutdown is inherently random (otherwise Redis would prevent/fix it), so it might be possible that we observe unrelated things coinciding with it. I believe the only way forward is to try reproducing the issue on a real Kubernetes cluster and share the logs of the pods that are being repeatedly terminated. Unfortunately, I won't have access to a real Kubernetes cluster for a few weeks.

hashemi-soroush commented 11 months ago

I talked to @mrtaalebi today. He had ran the experiment and gained more insight into the root cause. After the node shutdown, he read the logs of Redis and Sentinel instances and realized a split brain has happened and there were two Redis masters, each having a set of Sentinels vouching for it. So the problem is more serious than this issue initially suggested. @ouyangde , I strongly recommend changing the name of the issue to something related to split brain to get more attention from both the maintainers and the community.

P.S.: since the operator sets the redisfailover-role label of pods according the pod's characterization of itself, as a result of split brain, we will have two pods with master value which makes issues for #627 PR. The previously included method of connecting clients to Redis master relied on Sentinels to tell which Redis instance is the master. Since the split brain happens to Sentinels too, both methods fail at consistently connecting the clients to the same master.

hashemi-soroush commented 11 months ago

Thank you, @ebuildy , for bringing this issue to my attention. It's a very serious issue.

ebuildy commented 11 months ago

we use haproxy in front of all redis instances because of that.

hashemi-soroush commented 11 months ago

I don't understand how using haproxy helps with this issue. Can you elaborate, @ebuildy ?

ebuildy commented 11 months ago

Well the idea here is to have a service to talk to master node.

Haproxy can determine via health check who is the master, and send traffic to it.

See https://www.haproxy.com/blog/haproxy-advanced-redis-health-check

This is a very common pattern.

Le dim. 13 août 2023, 01:23, Seyed Soroush Hashemi @.***> a écrit :

I don't understand how using haproxy helps with this issue. Can you elaborate, @ebuildy https://github.com/ebuildy ?

— Reply to this email directly, view it on GitHub https://github.com/spotahome/redis-operator/issues/608#issuecomment-1676233353, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJJZ2LD2B7G4YDVSCK3JPTXVBQFHANCNFSM6AAAAAAYSVK4SI . You are receiving this because you were mentioned.Message ID: @.***>

hashemi-soroush commented 10 months ago

The issue here is that both Sentinels and Redis instances vouch for two masters simultaneously. If you ask each of those two masters whether they are master, they both say yes. If you ask all slaves to tell you which instance is the master, you get two different answers. Same goes for Sentinels. How can you possibly solve this with HAProxy, or any other load balancing software for that matter?