mysql changefeed blocked by abnormal kafka changefeed

Tammyxia commented 2 years ago

What did you do?

1 TiCDC, 1 broker, 1 mysql; 3 kafka changefeeds, 1 mysql changefeed
At first, all changefeeds work normally
When kafka service stopped, the 3 kafka changefeeds don't sync data as expected, but the checkpoint of mysql changefeed also will not move on.
[ { "id": "kafka-task-1", "summary": { "state": "normal", "tso": 430296185518948355, "checkpoint": "2022-01-06 14:14:42.308", "error": null } }, { "id": "kafka-task-3", "summary": { "state": "normal", "tso": 430296185518948355, "checkpoint": "2022-01-06 14:14:42.308", "error": null } }, { "id": "kafka-task-4", "summary": { "state": "normal", "tso": 430296185518948355, "checkpoint": "2022-01-06 14:14:42.308", "error": null } }, { "id": "mysql-task-1", "summary": { "state": "normal", "tso": 430296201260695559, "checkpoint": "2022-01-06 14:15:42.358", "error": null } } ] [root@CentOS76_VM log]# date Thu Jan 6 14:42:04 CST 2022

What did you expect to see?

mysql changefeed still can works normally

What did you see instead?

cdc log has expected ERROR: [ERROR] [changefeed.go:118] ["an error occurred in Owner"] [changefeedID=kafka-task-3] [error="[CDC:ErrKafkaNewSaramaProducer]new sarama producer: kafka: client has run out of available brokers to talk to (Is your cluster reachable?)"] WARN: [WARN] [client.go:226] ["etcd client outCh blocking too long, the etcdWorker may be stuck"] [duration=20m39.000245874s]

Versions of the cluster

Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

5.4.0

TiCDC version (execute cdc version):

5.4.0

Tammyxia commented 2 years ago

Another thing is, in this situation, the open files count continuous increasing:

3AceShowHand commented 2 years ago

when the broker out of line, cdc try to close the producer.

As shown in the picture above, close the syncClient is quite time-consuming.

This would cause the owner block for about 1 minutes.

3AceShowHand commented 2 years ago

maybe we should close the sync in a asynchronous way?

3AceShowHand commented 2 years ago

this is constraint by

3AceShowHand commented 2 years ago

For protocol, which would send checkpoint ts, this will happen.

At most, block the owner for about 1 minutes.

amyangfei commented 2 years ago

when the broker out of line, cdc try to close the producer.

As shown in the picture above, close the syncClient is quite time-consuming.

This would cause the owner block for about 1 minutes.

Good catch. If there are multiple Kafka changefeeds and close them one by one, the time cost will be larger.

3AceShowHand commented 2 years ago

This problem will cause the owner and processor blocked for a period of time, but only happen when all Kafka brokers were shut down. In a real-world production Kafka cluster, it would be rare that all Kafka brokers were shut down at the same time.

To solve this problem:

make the Close method will respond as soon as possible, which means block on a quite limited time. #4359
Initialize and close sink in an asynchronous ways, to prevent the owner and processor be blocked. #4249
expose some sink configuration can be configured by user.

3AceShowHand commented 2 years ago

after #4359 merged, blocking won't last too long, should no more than 2min, so that change the severity to minor.

nongfushanquan commented 2 years ago

//label affects-5.3

nongfushanquan commented 2 years ago

/label affects-5.3

pingcap / tiflow