spotahome / redis-operator

Redis Operator creates/configures/manages high availability redis with sentinel automatic failover atop Kubernetes.
Apache License 2.0
1.49k stars 356 forks source link

Failover not as expected - split brain ended with multiple masters #600

Closed txabman42 closed 11 months ago

txabman42 commented 1 year ago

I would like to know if you have faced a similar problem and if it could be related to the redis-operator.

I'm running sentinel on K8S using spotahome/redis-operator, helm chart version: v3.2.5.

I have not been able to reproduce the problem, as similar upgrades went well. Seems like a split-brain during the rollout process. Redis-operator is suggesting a manual fix, which doesn't provide too much security thinking of production environments.

Context

I have 3 redis and 3 sentinel instances using the image redis:6.2.6-alpine.

redis.conf

slaveof 127.0.0.1 6379
port 6379
tcp-keepalive 60
save 900 1
save 300 10
user pinger -@all +ping on >pingpass
masterauth pass
requirepass pass

sentinel.conf

sentinel monitor mymaster 127.0.0.1 6379 2
sentinel down-after-milliseconds mymaster 5000
sentinel failover-timeout mymaster 5000
sentinel parallel-syncs mymaster 2

K8S PBD:

NAME                        MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS
rfr-redis-routes-sentinels        2                N/A                 1                 
rfs-redis-routes-sentinels        2                N/A                 1                

During an update of the image to v6.2.12, I suffered the following situation:

There are some errors related to the redis-operator:

Here are the full logs: sentinels-logs.txt, redis-0-logs.txt, redis-1-logs.txt, redis-2-logs.txt, redis-operator-error-logs.txt

Scaling down replicas to 0 and scaling up again solved the issue.

Is there any way to avoid this? I wouldn't expect a manual fix during a rollout, especially thinking about a production environment.

txabman42 commented 1 year ago

I suggest doing here a re-creation of the redis failover, something similar as they are already doing in this other redis operator.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 45 days with no activity.

github-actions[bot] commented 11 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.