Failover not as expected - split brain ended with multiple masters

txabman42 commented 1 year ago

I would like to know if you have faced a similar problem and if it could be related to the redis-operator.

I'm running sentinel on K8S using spotahome/redis-operator, helm chart version: v3.2.5.

I have not been able to reproduce the problem, as similar upgrades went well. Seems like a split-brain during the rollout process. Redis-operator is suggesting a manual fix, which doesn't provide too much security thinking of production environments.

Context

I have 3 redis and 3 sentinel instances using the image redis:6.2.6-alpine.

redis.conf

slaveof 127.0.0.1 6379
port 6379
tcp-keepalive 60
save 900 1
save 300 10
user pinger -@all +ping on >pingpass
masterauth pass
requirepass pass

sentinel.conf

sentinel monitor mymaster 127.0.0.1 6379 2
sentinel down-after-milliseconds mymaster 5000
sentinel failover-timeout mymaster 5000
sentinel parallel-syncs mymaster 2

K8S PBD:

NAME                        MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS
rfr-redis-routes-sentinels        2                N/A                 1                 
rfs-redis-routes-sentinels        2                N/A                 1

During an update of the image to v6.2.12, I suffered the following situation:

Sentinel A --- restarted (13:22:25)
Redis-0 --- restarted (13:22:37)
Sentinel B --- restarted (13:23:08)
Sentinel C --- restarted (13:23:47)
Redis-1 --- restarted (13:23:41)
Redis-0 --- connection lost with master (redis-2) --- become master (13:24:34)
- From Redis-0 logs: Connection with master lost --- Caching the disconnected master state --- Discarding previously cached master state
- From Redis-2 logs: Connection with replica 172.16.89.182:6379 lost
- From sentinel logs: Executing user requested FAILOVER of 'mymaster' --- +new-epoch 7 --- ... --- +elected-leader master mymaster 172.16.55.123 6379 (Redis-2)
  - Note that here the failover process doesn't finish with failover-end
Redis-1 --- become slave of Redis-0 (13:25:04)
Redis-1 --- connection lost with master (Redis-0) --- become master (13:25:10)
Redis-2 --- restarted (13:25:15)
Redis-2 can't start correctly

There are some errors related to the redis-operator:

13:22:34 --- Error while getting masterIP : Failed to get info replication while querying redis instance " src="check.go:125"
13:22:34 --- Get slave of master failed, maybe this node is not ready, pod ip: " src="checker.go:163"
13:22:35 --- Make slave failed, slave ip: , master ip: 172.16.55.123 (Redis-2), error: dial tcp :6379: connect: connection refused" src="checker.go:167"
13:22:35 --- "error on object processing: dial tcp :6379: connect: connection refused" controller-id=redisfailover object-key=routes/redis-routes-sentinels operator=redisfailover service=kooper.controller src="controller.go:279"
...
13:25:35 --- "error on object processing: More than one master, fix manually"

Here are the full logs: sentinels-logs.txt, redis-0-logs.txt, redis-1-logs.txt, redis-2-logs.txt, redis-operator-error-logs.txt

Scaling down replicas to 0 and scaling up again solved the issue.

Is there any way to avoid this? I wouldn't expect a manual fix during a rollout, especially thinking about a production environment.

txabman42 commented 1 year ago

I suggest doing here a re-creation of the redis failover, something similar as they are already doing in this other redis operator.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 45 days with no activity.

github-actions[bot] commented 11 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

spotahome / redis-operator

Failover not as expected - split brain ended with multiple masters #600

Context