pingcap / tiflow

This repo maintains DM (a data migration platform) and TiCDC (change data capture for TiDB)
Apache License 2.0
430 stars 286 forks source link

TiCDC cluster suffers a round robin owner election during rolling update #3529

Closed amyangfei closed 2 years ago

amyangfei commented 3 years ago

What did you do?

  1. Create a TiCDC cluster with multiple nodes, such as 7 nodes.
  2. Rolling update the TiCDC cluster

What did you expect to see?

Replication continues normally during TiCDC is rolling update

What did you see instead?

Supposing the owner is restarted at first, then owner will be elected to each following TiCDC node(This is caused by the election way in etcd, it simply selects the election key with the smallest revision as the campaign winner), while the elected owner will be restarted soon by rolling update.

The initialization phase of a TiCDC owner could cost long time, it has many procedures, including initializing each existing changefeeds (when initializing a changefeed it will create a downstream sink, imaging we create a Kafka sink and do some verification jobs, it is heavy work).

Then we will waste a lot of time in each TiCDC owner node to do owner initialization. What's more, maybe no owner finishes initialization before it restarts, the replication checkpoint could pause during rolling update, and the longer rolling update takes, the larger replication lag may happen.

Versions of the cluster

Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

v5.3.0

TiCDC version (execute cdc version):

master@https://github.com/pingcap/ticdc/commit/fe92b89a0ea05a066a61a94d96440f38504d170e

Brainstorming

amyangfei commented 3 years ago

After discussion, we decide to make this a feature, for two reasons

overvenus commented 2 years ago

Since this issue doesn't break TiCDC boundary, change to severity/moderate.

3AceShowHand commented 2 years ago

https://github.com/pingcap/tiup/pull/1972

This is solved by supporting upgrade the owner at the last strategy in the PR above.