tikv / pd

Placement driver for TiKV
Apache License 2.0
1.05k stars 719 forks source link

Unsafe recovery partially fills key range hole #6859

Open overvenus opened 1 year ago

overvenus commented 1 year ago

Bug Report

On a 4-node TiKV cluster, we stops two nodes and then starts unsafe recovery using pd-ctl. After unsafe recovery, we find there are lots of PD server timeout, and it turns out there is a region fails to be created.

Failed TiKV: tikv-0 and tikv-1 Alive TiKV: tikv-2 and tikv-3 Original region ID: 1965 New region ID: 2991

Timeline:

  1. 1965 on tikv-3 sends a snapshot to tikv-2.
  2. Starts unsafe recovery.
  3. Snapshot sent.
  4. 1965 on tikv-3 becomes tombstone.
  5. A peer of 1965 is created on tikv-2.
  6. PD sends to tikv-2 to create 2991 to cover the key rang of 1965.
  7. 2991 fails to be created because 1965 has been created on tikv-3.
  8. PD considers unsafe recovery is finished.

There are actually two questions:

  1. Why does PD finish unsafe recovery while there is a key rang hole?
  2. Why does PD tombstone 1965 in the first place? Stoping two nodes out of four nodes cluster should not lost replica data completely.

Note: the issue is found on a multi-rocksdb cluster. But I think it may affect single rocksdb cluster too.

Log:

What did you do?

See above.

What version of PD are you using (pd-server -V)?

v7.1.0

v01dstar commented 1 year ago

Maybe not relevant, just for references, region 1965 received 1 vote from the dead store 1

[2023/07/28 07:54:59.375 +00:00] [INFO] [raft.rs:2230] ["received votes response"] [term=9] [type=MsgRequestVoteResponse] [approvals=2] [rejections=0] [f om=1967] [vote=true] [raft_id=1968] [peer_id=1968] [region_id=1965]

Members:

region_epoch { conf_ver: 59 version: 109 } peers { id: 1967 store_id: 1 } peers { id: 1968 store_id: 216 } peers { id: 2783 store_id: 45 }"] [legacy=false] [changes="[change_type: AddLearnerNode peer { id: 2990 store_id: 4 role: Learner }]"] [peer_id=1968] [region_id=1965]
v01dstar commented 1 year ago

I can't find any clue from the log.

I think the the snapshot related stuff was "ok" in this case, the key is to find out why PD decided to tombstone 1965 on store 216 (tikv3), this only happens when another newer region covers the range of 1965, but from the log, I could not find such regions.

@overvenus I suggest we add some info log in PD, print out any overlap regions while building the range tree. And wait for this problem occur again?

overvenus commented 1 year ago

Besides adding logs, can we check if all regions have quorum replicas alive before exiting unsafe recovery?