qps drop to zero and also influence the lag of pitr or cdc during injection pd leader io hang or io delay 500ms or 1s due to transfer etcd leader failed

tikv / pd

Placement driver for TiKV

Apache License 2.0

1.05k stars 720 forks source link

qps drop to zero and also influence the lag of pitr or cdc during injection pd leader io hang or io delay 500ms or 1s due to transfer etcd leader failed #8204

Open Lily2025 opened 5 months ago

Lily2025 commented 5 months ago

Bug Report

What did you do?

1、run tpcc 2、inject pd leader io hang last for 5m

What did you expect to see?

qps can recover within 2mins

What did you see instead?

qps drop to zero during injection pd leader io hang img_v3_02b3_fcc533fe-6dec-40c5-8a12-cc9601a3434g

img_v3_02b3_ed19d944-62f5-4cf9-95b4-412749df9c6g

io delay 500ms

What version of PD are you using (`pd-server -V`)?

./pd-server -V Release Version: v8.1.0 Edition: Community Git Commit Hash: fca469ca33eb5d8b5e0891b507c87709a00b0e81 Git Branch: HEAD UTC Build Time: 2024-05-09 02:15:45 2024-05-17T01:42:48.440+0800

Lily2025 commented 5 months ago

/assign JmPotato /type enhancement

JmPotato commented 5 months ago

According to the logs, after detecting repeated elections, the PD leader will attempt to resign as etcd leader. However, due to the impact of IO hang on the internal state machine of etcd, this operation cannot be performed. As a result, the current PD leader cannot be properly elected and it is also impossible to transfer the etcd leader to other healthy nodes. The current workaround is to directly kill the unhealthy node for forced re-election.

Lily2025 commented 3 months ago

img_v3_02dg_d6051a61-7835-46f8-84bf-ceeb9199ecfg