Open Funkydream opened 11 months ago
Are you using containers? If so, maybe this fix will resolve your issue.
Are you using containers? If so, maybe this fix will resolve your issue.
No, all redis and sentinels run on virtual machines. I reproduced this problem using redis-sentinel version 6.2.14.
Are you using containers? If so, maybe this fix will resolve your issue.
I have a few ideas, please help me confirm them:
sentinelVoteLeader()
, if the following voting results are obtained in the same epoch. Do all sentinels have to wait for an newer epoch, or wait for the next round of voting after the election-timeout (then 2*failover-timout) in the following situations ?A voted for B 45 B voted for C 45 C voted for D 45 D voted for E 45 E voted for A 45
or A voted for A 45 B voted for B 45 C voted for C 45 D voted for D 45 E voted for E 45
char *sentinelVoteLeader(sentinelRedisInstance *master, uint64_t req_epoch, char *req_runid, uint64_t *leader_epoch) {
if (req_epoch > sentinel.current_epoch) {
sentinel.current_epoch = req_epoch;
sentinelFlushConfig();
sentinelEvent(LL_WARNING,"+new-epoch",master,"%llu",
(unsigned long long) sentinel.current_epoch);
}
if (master->leader_epoch < req_epoch && sentinel.current_epoch <= req_epoch)
{
sdsfree(master->leader);
master->leader = sdsnew(req_runid);
master->leader_epoch = sentinel.current_epoch;
sentinelFlushConfig();
sentinelEvent(LL_WARNING,"+vote-for-leader",master,"%s %llu",
master->leader, (unsigned long long) master->leader_epoch);
/* If we did not voted for ourselves, set the master failover start
* time to now, in order to force a delay before we can start a
* failover for the same master. */
if (strcasecmp(master->leader,sentinel.myid))
master->failover_start_time = mstime()+rand()%SENTINEL_MAX_DESYNC;
}
*leader_epoch = master->leader_epoch;
return master->leader ? sdsnew(master->leader) : NULL;
}
E: master failover epoch = 45 current epoch =100 A voted for E 45 B voted for E 45 C voted for E 45 D voted for E 45
/* Count other sentinels votes */
di = dictGetIterator(master->sentinels);
while((de = dictNext(di)) != NULL) {
sentinelRedisInstance *ri = dictGetVal(de);
if (ri->leader != NULL && ri->leader_epoch == sentinel.current_epoch)
sentinelLeaderIncr(counters,ri->leader);
}
Describe the bug
I've deployed 100 Redis groups with version 6.0.9 in 2 servers(1 master + 1 slave) and 5 sentinels configuration. When all masters in redis group suddenly disconnected from the network (they deployed in the same DC), this sentinel group began to try failover. I configured the following parameters for sentinels:
When I noticed that the recovery time was longer than expected, I checked the Sentinel logs. I found that some sentinels got a majority of votes, but still reported "-failover-abort-not-elected". Here is one sample.
To reproduce
I tried to actively disconnect the master DC network several times and found that this problem still exists.
Expected behavior
I can accept 2-3 failovers before successfully recovering, but getting enough votes to declare the election failed confuses me.The failover process should start when one sentinel got majority.
Additional information
I suspect this is due to rapidly growing epochs over a short period of time and I'm looking for evidence in the code.