Open Lily2025 opened 5 months ago
/assign JmPotato /type enhancement
According to the logs, after detecting repeated elections, the PD leader will attempt to resign as etcd leader. However, due to the impact of IO hang on the internal state machine of etcd, this operation cannot be performed. As a result, the current PD leader cannot be properly elected and it is also impossible to transfer the etcd leader to other healthy nodes. The current workaround is to directly kill the unhealthy node for forced re-election.
Bug Report
What did you do?
1、run tpcc 2、inject pd leader io hang last for 5m
What did you expect to see?
qps can recover within 2mins
What did you see instead?
qps drop to zero during injection pd leader io hang
io delay 500ms
What version of PD are you using (
pd-server -V
)?./pd-server -V Release Version: v8.1.0 Edition: Community Git Commit Hash: fca469ca33eb5d8b5e0891b507c87709a00b0e81 Git Branch: HEAD UTC Build Time: 2024-05-09 02:15:45 2024-05-17T01:42:48.440+0800