pingcap / tiflow

This repo maintains DM (a data migration platform) and TiCDC (change data capture for TiDB)
Apache License 2.0
416 stars 274 forks source link

stability (cdc) improve the stability of TiCDC #10343

Open zhangjinpeng87 opened 6 months ago

zhangjinpeng87 commented 6 months ago

How to define stability of TiCDC?

TiCDC as a distributed system, it should continuously provide service with predictable replication lag under any reasonable situations like single TiCDC node failure, single upstream TiKV node failure, single upstream PD node failure, planned rolling upgrade/restart of TiCDC or upstream TiDB cluster, temporarily network partition between one CDC node and other CDC nodes, etc. TiCDC should recover the replication lag by itself quickly and tolerant different resilience cases.

Expected replication lag SLO under different cases

Category Case Description Expected Behavior
Planned Operations Rolling upgrade/restart TiCDC replication lag < 5s
Scale-in/scale-out TiCDC replication lag < 5s
Rolling upgrade/restart upstream PD replication lag < 5s
Rolling upgrade/restart upstream TiKV replication lag < 10s
Scale-in/scale-out upstream TiDB replication lag < 5s
Rolling upgrade/restart downstream Kafka brokers begin to sink ASAP kafka resumed
Rolling upgrade/restart downstream MySQL/TiDB begin to sink ASAP kafka resumed
Unplanned Failures Single TiCDC node (random one) permanent failure replication lag < 1min
Single TiCDC node temporarily failure for 5 minutes replication lag < 1min
PD leader permanent failure or temporarily failure for 5 minutes replication lag < 5s
Network partition between one TiCDC node and PD leader for 5 minutes replication lag < 5s
Network partition between one TiCDC node and other TiCDC nodes replication lag < 5s

Principle of prioritizing TiCDC stability issues

We deal with TiCDC stability issues as following priorities

Tasks Tracking

flowbehappy commented 6 months ago

https://github.com/pingcap/tiflow/issues/10157 @zhangjinpeng1987 resolve ts gets stuck issue