pingcap / tiflow

This repo maintains DM (a data migration platform) and TiCDC (change data capture for TiDB)
Apache License 2.0
429 stars 287 forks source link

CDC lag up to 7min when injecting ha-pdleader-io-delay-1s-last-for-5m, though pd leader transferred soon #11569

Open fubinzh opened 2 months ago

fubinzh commented 2 months ago

What did you do?

  1. TiDB cluster with CDC changefeed running normally
  2. Inject ha-pdleader-io-delay-1s-last-for-5m (from 2024-09-04 12:37:58 to 22024-09-04 12:42:58)
  3. Check cluster status and CDC lag

What did you expect to see?

CDC lag should be <2min

What did you see instead?

PD leader transfer after chaos injection. But CDC didn't have leader for ~5min, and CDC lag up to ~7min

2024-09-04 12:38:01 
{"container":"pd","log":"[raft.go:771] [\"646d794e12a46726 became leader at term 4\"]","namespace":"uds-cdc-br-scenario-tps-7624385-1-510","level":"INFO","pod":"upstream-pd-0"}

2024-09-04 12:38:26 
{"container":"pd","log":"[server.go:1804] [\"PD leader is ready to serve\"] [leader-name=upstream-pd-0]","namespace":"uds-cdc-br-scenario-tps-7624385-1-510","level":"INFO","pod":"upstream-pd-0"}

2024-09-04 12:38:26 
{"container":"pd","log":"[server.go:1730] [\"campaign PD leader ok\"] [campaign-leader-name=upstream-pd-0]","namespace":"uds-cdc-br-scenario-tps-7624385-1-510","level":"INFO","pod":"upstream-pd-0"}

2024-09-04 12:38:26 
{"container":"pd","log":"[server.go:1704] [\"start to campaign PD leader\"] [campaign-leader-name=upstream-pd-0]","namespace":"uds-cdc-br-scenario-tps-7624385-1-510","level":"INFO","pod":"upstream-pd-0"}

image image image

Versions of the cluster

/cdc version Release Version: v8.2.0 Git Commit Hash: 498e3d3fd1cda4817e70ea50d27dcb157956349d Git Branch: HEAD UTC Build Time: 2024-07-03 02:52:36 Go Version: go version go1.21.10 linux/amd64 Failpoint Build: false

flowbehappy commented 1 day ago

Will further investigate the issue on the new arch ticdc https://github.com/pingcap/ticdc. Won't fix on the current repo.