pingcap / tiflow

This repo maintains DM (a data migration platform) and TiCDC (change data capture for TiDB)
Apache License 2.0
429 stars 287 forks source link

mysql changefeed blocked by abnormal kafka changefeed #4241

Closed Tammyxia closed 2 years ago

Tammyxia commented 2 years ago

What did you do?

What did you expect to see?

mysql changefeed still can works normally

What did you see instead?

cdc log has expected ERROR: [ERROR] [changefeed.go:118] ["an error occurred in Owner"] [changefeedID=kafka-task-3] [error="[CDC:ErrKafkaNewSaramaProducer]new sarama producer: kafka: client has run out of available brokers to talk to (Is your cluster reachable?)"] WARN: [WARN] [client.go:226] ["etcd client outCh blocking too long, the etcdWorker may be stuck"] [duration=20m39.000245874s]

Versions of the cluster

Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

5.4.0

TiCDC version (execute cdc version):

5.4.0
Tammyxia commented 2 years ago

Another thing is, in this situation, the open files count continuous increasing: image

3AceShowHand commented 2 years ago

when the broker out of line, cdc try to close the producer.

Screen Shot 2022-01-06 at 7 01 27 PM

As shown in the picture above, close the syncClient is quite time-consuming.

This would cause the owner block for about 1 minutes.

3AceShowHand commented 2 years ago

maybe we should close the sync in a asynchronous way?

3AceShowHand commented 2 years ago

this is constraint by image

3AceShowHand commented 2 years ago

For protocol, which would send checkpoint ts, this will happen.

At most, block the owner for about 1 minutes.

amyangfei commented 2 years ago

when the broker out of line, cdc try to close the producer. Screen Shot 2022-01-06 at 7 01 27 PM

As shown in the picture above, close the syncClient is quite time-consuming.

This would cause the owner block for about 1 minutes.

Good catch. If there are multiple Kafka changefeeds and close them one by one, the time cost will be larger.

3AceShowHand commented 2 years ago

This problem will cause the owner and processor blocked for a period of time, but only happen when all Kafka brokers were shut down. In a real-world production Kafka cluster, it would be rare that all Kafka brokers were shut down at the same time.

To solve this problem:

3AceShowHand commented 2 years ago

after #4359 merged, blocking won't last too long, should no more than 2min, so that change the severity to minor.

nongfushanquan commented 2 years ago

//label affects-5.3

nongfushanquan commented 2 years ago

/label affects-5.3