pingcap / tiflow

This repo maintains DM (a data migration platform) and TiCDC (change data capture for TiDB)
Apache License 2.0
417 stars 272 forks source link

checkpoint stucks for over 1.5h, results in changefeed failure #11335

Open fubinzh opened 1 week ago

fubinzh commented 1 week ago

What did you do?

  1. There are five changefeed (Kafka sink, simple protocol) running, each replicating a subset of the tables.
  2. At first there are workload running for all the changefeed, cdc lag not stable
  3. Later there is only workload running for the changefeed: 1k-odd, 1k-even
  4. Wait and check lag status

What did you expect to see?

  1. Lag for all changefeed should be normal, at least for changefeeds whose workload running the lag should be normal after incremental scan finishes.

What did you see instead?

changefeed xxx5k not restore to normal. changefeed point stucks for 1.5h +, and finally changefeed into failure state. image image

Versions of the cluster

Release Version: v8.2.0-alpha
Git Commit Hash: 90da67d1af284ff7408140969b7d5fb2ecc43bea
Git Branch: heads/refs/tags/v8.2.0-alpha
UTC Build Time: 2024-06-14 11:38:03
Go Version: go version go1.21.10 linux/amd64
Failpoint Build: false

fubinzh commented 5 days ago

/severity moderate The throughtpu up to 300MB/s when the issue happens.