etcd worker tick stuck up to 2 minutes periodically when cdc can't connect to Kafka server

asddongmen commented 3 months ago

What did you do?

Creata a changefeed sink to kafka.
Inject a network partition between cdc and kafka.

What did you expect to see?

changefeed stucks but processor will not be stuck.

What did you see instead?

[2024/06/21 19:03:54.617 +08:00] [WARN] [etcd_worker.go:293] ["EtcdWorker reactor tick took too long"] [duration=50.822188739s] [role=processor]
[2024/06/21 20:04:52.975 +08:00] [WARN] [processor.go:540] ["processor tick took too long"] [namespace=default] [changefeed=kafka-task] [capture=4556c672-6e93-4054-a705-c598fcf64b84] [duration=2m1.979526934s]

Versions of the cluster

Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

(paste TiDB cluster version here)

Upstream TiKV version (execute tikv-server --version):

(paste TiKV version here)

TiCDC version (execute cdc version):

master

asddongmen commented 3 months ago

The severity of this issue is set to major because if there are multiple changefeeds in cdc, the delay of other changefeeds will also be affected, resulting in a cyclical lag of about 2 minutes.

The root cause of this issue is that when a kafka sink encounters an error, it will retry within the sinkManager, at which point it calls initSinkFactory to attempt to reconnect to kafka: https://github.com/pingcap/tiflow/blob/1252979421fc83ffa2a1548d981e505f7fc0b909/cdc/processor/sinkmanager/manager.go#L262 This function internally will hold m.sinkFactory.Lock() until the function exits. When kafka cannot be connected, this function will block for about 2 minutes, until the underlying call throws kafka: client has run out of available brokers to talk to: dial tcp 10.99.219.92:9092: i/o timeout" before it exits.

Meanwhile, In another goroutine, Processor calls GetTableSpanStatus at each Tick to collect table progress information and advance the overall task progress: https://github.com/pingcap/tiflow/blob/1252979421fc83ffa2a1548d981e505f7fc0b909/cdc/processor/processor.go#L346

When the downstream of changefeed is kafka, the interior of GetTableSpanStatus will call the needsStuckCheck function of sinkManager: https://github.com/pingcap/tiflow/blob/1252979421fc83ffa2a1548d981e505f7fc0b909/cdc/processor/sinkmanager/manager.go#L318 This function will try to get m.sinkFactory.Lock(), but since this lock has already been held by initSinkFactory, the Processor Tick will be blocked in needsStuckCheck before it is released. Therefore, a relatively simple solution is to let the needsStuckCheck function no longer try to get m.sinkFactory.Lock(), so no blocking will occur.

asddongmen commented 3 months ago

Based on advice from @hicqu, after the merge of https://github.com/pingcap/tiflow/pull/9852, the needsStuckCheck mechanism can be removed, which can also resolve this issue.

seiya-annie commented 3 months ago

/found customer

pingcap / tiflow