Open overvenus opened 1 year ago
Maybe not relevant, just for references, region 1965 received 1 vote from the dead store 1
[2023/07/28 07:54:59.375 +00:00] [INFO] [raft.rs:2230] ["received votes response"] [term=9] [type=MsgRequestVoteResponse] [approvals=2] [rejections=0] [f om=1967] [vote=true] [raft_id=1968] [peer_id=1968] [region_id=1965]
Members:
region_epoch { conf_ver: 59 version: 109 } peers { id: 1967 store_id: 1 } peers { id: 1968 store_id: 216 } peers { id: 2783 store_id: 45 }"] [legacy=false] [changes="[change_type: AddLearnerNode peer { id: 2990 store_id: 4 role: Learner }]"] [peer_id=1968] [region_id=1965]
I can't find any clue from the log.
I think the the snapshot related stuff was "ok" in this case, the key is to find out why PD decided to tombstone 1965 on store 216 (tikv3), this only happens when another newer region covers the range of 1965, but from the log, I could not find such regions.
@overvenus I suggest we add some info log in PD, print out any overlap regions while building the range tree. And wait for this problem occur again?
Besides adding logs, can we check if all regions have quorum replicas alive before exiting unsafe recovery?
Bug Report
On a 4-node TiKV cluster, we stops two nodes and then starts unsafe recovery using pd-ctl. After unsafe recovery, we find there are lots of PD server timeout, and it turns out there is a region fails to be created.
Failed TiKV: tikv-0 and tikv-1 Alive TiKV: tikv-2 and tikv-3 Original region ID: 1965 New region ID: 2991
Timeline:
There are actually two questions:
Note: the issue is found on a multi-rocksdb cluster. But I think it may affect single rocksdb cluster too.
Log:
What did you do?
See above.
What version of PD are you using (
pd-server -V
)?v7.1.0