tikv / pd

Placement driver for TiKV
Apache License 2.0
1.05k stars 720 forks source link

After set placement rule, store take more than 30min to transfer leader in two stores #4439

Open mayjiang0203 opened 2 years ago

mayjiang0203 commented 2 years ago

Bug Report

What did you do?

1、deploy 3 zone cluster, each one has 2 stores. 2、Use br import 200G data. 3、Set placement that zone1/zone2 voter, zone3 follower.

What did you expect to see?

After placement rules set succeeded, stores in zone3 should evict region leaders in a very short time.

What did you see instead?

image

What version of PD are you using (pd-server -V)?

/ # /pd-ctl -V Release Version: v5.3.0 Edition: Community Git Commit Hash: fe6fab9268d2d6fd34cd22edd1cf31a302e8dc5c Git Branch: heads/refs/tags/v5.3.0 UTC Build Time: 2021-11-22 10:50:47

logs and monitor data can be get from minio by testbed testbed-oltp-hm-qhv25.

bufferflies commented 2 years ago

It needs to test again to exclude the wrong placement rule.

mayjiang0203 commented 2 years ago

/severity moderate

bufferflies commented 2 years ago

the leader scheduler became so slow because the operator needs to occupy store limit. In the beginnig, the checker will only create transfer leader operator that doesn't ocuppy any store limit, but after one minitues, rule checker find another solution that transfer leader and move peer to decrease operator count. But this operator needs to occupy some store limit, so this will be slow. But in the end, this store will offline successful. Perhaps, we should consider some priority of store to speed this offline quickly.

bufferflies commented 2 years ago

I think it should belong to feature enhence rather bug, rule checker merge some steps to minimize operator count and the store limit completely utilize.