Closed txabman42 closed 11 months ago
I suggest doing here a re-creation of the redis failover, something similar as they are already doing in this other redis operator.
This issue is stale because it has been open for 45 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale.
I would like to know if you have faced a similar problem and if it could be related to the
redis-operator
.I'm running sentinel on K8S using spotahome/redis-operator, helm chart version:
v3.2.5
.I have not been able to reproduce the problem, as similar upgrades went well. Seems like a split-brain during the rollout process. Redis-operator is suggesting a manual fix, which doesn't provide too much security thinking of production environments.
Context
I have 3 redis and 3 sentinel instances using the image
redis:6.2.6-alpine
.redis.conf
sentinel.conf
K8S PBD:
During an update of the image to
v6.2.12
, I suffered the following situation:Connection with master lost
---Caching the disconnected master state
---Discarding previously cached master state
Connection with replica 172.16.89.182:6379 lost
Executing user requested FAILOVER of 'mymaster'
---+new-epoch 7
--- ... ---+elected-leader master mymaster 172.16.55.123 6379
(Redis-2)failover-end
There are some errors related to the
redis-operator
:Here are the full logs: sentinels-logs.txt, redis-0-logs.txt, redis-1-logs.txt, redis-2-logs.txt, redis-operator-error-logs.txt
Scaling down replicas to 0 and scaling up again solved the issue.
Is there any way to avoid this? I wouldn't expect a manual fix during a rollout, especially thinking about a production environment.