Open pawpy opened 7 years ago
Hello, thank you for reporting this, I'll try to replicate but it is not clear, if no failover occurred, who reconfigured C
to be a slave and to replicate from A
. At every step, did you try to check the Sentinel logs to see if any failover happened, if the original configuration of the instances were correct, and if, when C
restarts, for some reason loads some older configuration?
Hello,
Thank you for your response. No, failover happened to A as far as I can see. I'm posting the sentinel logs below:
A - 192.168.0.5
, B - 192.168.0.6
and C - 192.168.0.7
(sentinel logs from host A)
5824:X 24 Feb 10:59:49.746 # Sentinel ID is d7471d19edd71ae3482de3285c9c1c01b9f7a31e
5824:X 24 Feb 10:59:49.746 # +monitor master mymaster 192.168.0.7 26380 quorum 2
5824:X 24 Feb 10:59:54.763 # +sdown slave 192.168.0.6:26380 192.168.0.6 26380 @ mymaster 192.168.0.7 26380
5824:X 24 Feb 10:59:54.763 # +sdown sentinel 7c38b3d83d0cc303b33454142bbe83d3dbf9f1c8 192.168.0.6 5000 @ mymaster 192.168.0.7 26380
5824:X 24 Feb 11:00:15.263 # +sdown master mymaster 192.168.0.7 26380
5824:X 24 Feb 11:00:15.263 # +sdown sentinel c4eaadc05b465dea435937a23f4225adcd822731 192.168.0.7 5000 @ mymaster 192.168.0.7 26380
5824:X 24 Feb 11:01:55.244 # -sdown slave 192.168.0.6:26380 192.168.0.6 26380 @ mymaster 192.168.0.7 26380
5824:X 24 Feb 11:01:55.244 # -sdown sentinel 7c38b3d83d0cc303b33454142bbe83d3dbf9f1c8 192.168.0.6 5000 @ mymaster 192.168.0.7 26380
5824:X 24 Feb 11:01:59.442 # +new-epoch 39
5824:X 24 Feb 11:01:59.445 # +vote-for-leader 7c38b3d83d0cc303b33454142bbe83d3dbf9f1c8 39
5824:X 24 Feb 11:01:59.445 # +odown master mymaster 192.168.0.7 26380 #quorum 2/2
5824:X 24 Feb 11:01:59.445 # Next failover delay: I will not start a failover before Fri Feb 24 11:03:59 2017
5824:X 24 Feb 11:03:59.180 # +new-epoch 40
5824:X 24 Feb 11:03:59.183 # +vote-for-leader 7c38b3d83d0cc303b33454142bbe83d3dbf9f1c8 40
5824:X 24 Feb 11:03:59.189 # Next failover delay: I will not start a failover before Fri Feb 24 11:06:00 2017
5824:X 24 Feb 11:05:25.200 * +reboot master mymaster 192.168.0.7 26380
5824:X 24 Feb 11:05:25.275 # -sdown master mymaster 192.168.0.7 26380
5824:X 24 Feb 11:05:25.275 # -odown master mymaster 192.168.0.7 26380
5824:X 24 Feb 11:05:25.275 # -sdown sentinel c4eaadc05b465dea435937a23f4225adcd822731 192.168.0.7 5000 @ mymaster 192.168.0.7 26380
5824:X 24 Feb 11:05:50.282 # +sdown master mymaster 192.168.0.7 26380
5824:X 24 Feb 11:05:50.685 # +new-epoch 41
5824:X 24 Feb 11:05:50.688 # +vote-for-leader c4eaadc05b465dea435937a23f4225adcd822731 41
5824:X 24 Feb 11:05:51.394 # +odown master mymaster 192.168.0.7 26380 #quorum 3/2
5824:X 24 Feb 11:05:51.394 # Next failover delay: I will not start a failover before Fri Feb 24 11:07:50 2017
5824:X 24 Feb 11:07:50.905 # +new-epoch 42
5824:X 24 Feb 11:07:50.905 # +try-failover master mymaster 192.168.0.7 26380
5824:X 24 Feb 11:07:50.909 # +vote-for-leader d7471d19edd71ae3482de3285c9c1c01b9f7a31e 42
5824:X 24 Feb 11:07:50.944 # 7c38b3d83d0cc303b33454142bbe83d3dbf9f1c8 voted for d7471d19edd71ae3482de3285c9c1c01b9f7a31e 42
5824:X 24 Feb 11:07:50.947 # c4eaadc05b465dea435937a23f4225adcd822731 voted for d7471d19edd71ae3482de3285c9c1c01b9f7a31e 42
5824:X 24 Feb 11:07:50.962 # +elected-leader master mymaster 192.168.0.7 26380
5824:X 24 Feb 11:07:50.962 # +failover-state-select-slave master mymaster 192.168.0.7 26380
5824:X 24 Feb 11:07:51.017 # -failover-abort-no-good-slave master mymaster 192.168.0.7 26380
5824:X 24 Feb 11:07:51.083 # Next failover delay: I will not start a failover before Fri Feb 24 11:09:51 2017
5824:X 24 Feb 11:09:51.113 # +new-epoch 43
5824:X 24 Feb 11:09:51.113 # +try-failover master mymaster 192.168.0.7 26380
5824:X 24 Feb 11:09:51.117 # +vote-for-leader d7471d19edd71ae3482de3285c9c1c01b9f7a31e 43
5824:X 24 Feb 11:09:51.149 # 7c38b3d83d0cc303b33454142bbe83d3dbf9f1c8 voted for d7471d19edd71ae3482de3285c9c1c01b9f7a31e 43
5824:X 24 Feb 11:09:51.160 # c4eaadc05b465dea435937a23f4225adcd822731 voted for c4eaadc05b465dea435937a23f4225adcd822731 43
5824:X 24 Feb 11:09:51.184 # +elected-leader master mymaster 192.168.0.7 26380
5824:X 24 Feb 11:09:51.184 # +failover-state-select-slave master mymaster 192.168.0.7 26380
5824:X 24 Feb 11:09:51.267 # -failover-abort-no-good-slave master mymaster 192.168.0.7 26380
5824:X 24 Feb 11:09:51.329 # Next failover delay: I will not start a failover before Fri Feb 24 11:11:51 2017
The only thing I suspect for C to think A is master is this slaveof A
setting in redis server config.
Just adding that I have slaveof A
set in redis configuration on B and C (added to initialize the master-slave replication setup.)
The problem appears to be that Sentinel does not know that you changed the configuration without it performing a failover. If you want to switch master you have to trigger a failover via Sentinel, using the manual failover Sentinel procedure. So, when it performs a failover, it will upgrade the configurations of instances using CONFIG REWRITE
and other precautions. Here instead the configuration of the instances A, B, and C are different than what Sentinel believes: you start monitoring them when C is the master, but C configuration is to be a slave of A.
So basically:
redis.conf
in each instance.At this point, every time there is a fail over, Sentinel will make sure that all the configurations are in sync.
@antirez no, the redis configuration has not been changed to trigger a failover. I was only pointing that slaveof A
is the initial configuration on B and C hosts (as in step (1) in your recommended usage), and perhaps this setting is making C to think A as the master.
Strange enough, B picked C as the master correctly when it had the same slaveof A
in redis configuration.
@pawpy sorry but I don't understand, in the original message you wrote:
A, B, C servers in 1 master, 2 slaves, 3 sentinels setup. A, B are slaves and C is master The redis server conf has slaveof A on B and C for initialization purposes
How it is possible that now C
is a master if the cluster was started with B
and C
with slaveof A
?
@antirez sorry if my description of the problem is unclear.
Sentinel setup is initialized as - A being master, B and C being slaves (using slaveof A
).
Weeks later, after possibly many failover's (all triggered by sentinel) we have C as master with A and B as slaves. This is when two of our systems went down, first the slave B (step 1), then master C (step 2), and sequence of events in step 3-5 which left no master being up in the HA sentinel setup.
In so far as I can see, the initialization state doesn't matter. I was able to reproduce this in couple of our HA test clusters. What I'm essentially doing is to bring down one of slave, then master, then bring up slave that is down followed by master.
Hello, have you had a chance to look at this further?
@pawpy Did you ever get this figured out? I'm thinking about using Sentinel, but not sure it's worth the trouble.
HI all.
@kaizen1 Redis Sentinel is a must in a HA environment.
I have the same problem, I guess. I receive many emails from Redis sentinel which is "flipping" master-slave on many nodes and after a while, a new master is elected but the VM are still up. Consider this email flux:
I saw when this happens, Redis had almost 8/10 milion keys and I'm pretty sure this is related to the wrong use of keys
we did and this collapsed our Redis the time we used for something more. We touched 40k ops/sec. When I removed all the keys
usage, the problem disappeared, lowering the ops/sec to a modest 5k ops/sec.
keys
is blocking and if Sentinel asks for a PING
waiting the PONG
response, with a huge keyspace and a blocked IO due a wrong keys
launched, this could be treated as a failure since the response could arrive in 20-30 seconds.
For us, the SLOWLOG GET
helped a lot.
OT: Grazie Salvatore per averci dato Redis 👍
Redis server version: 3.2.4-1
Setup:
slaveof A
on B and C for initialization purposes2
insentinel monitor
in sentinel configIssue:
We see a replication loop in below scenario.