Closed asddongmen closed 2 months ago
The severity of this issue is set to major because if there are multiple changefeeds in cdc, the delay of other changefeeds will also be affected, resulting in a cyclical lag of about 2 minutes.
The root cause of this issue is that when a kafka sink encounters an error, it will retry within the sinkManager, at which point it calls initSinkFactory
to attempt to reconnect to kafka: https://github.com/pingcap/tiflow/blob/1252979421fc83ffa2a1548d981e505f7fc0b909/cdc/processor/sinkmanager/manager.go#L262
This function internally will hold m.sinkFactory.Lock()
until the function exits.
When kafka cannot be connected, this function will block for about 2 minutes, until the underlying call throws kafka: client has run out of available brokers to talk to: dial tcp 10.99.219.92:9092: i/o timeout"
before it exits.
Meanwhile, In another goroutine, Processor calls GetTableSpanStatus
at each Tick to collect table progress information and advance the overall task progress: https://github.com/pingcap/tiflow/blob/1252979421fc83ffa2a1548d981e505f7fc0b909/cdc/processor/processor.go#L346
When the downstream of changefeed is kafka, the interior of GetTableSpanStatus
will call the needsStuckCheck
function of sinkManager: https://github.com/pingcap/tiflow/blob/1252979421fc83ffa2a1548d981e505f7fc0b909/cdc/processor/sinkmanager/manager.go#L318
This function will try to get m.sinkFactory.Lock()
, but since this lock has already been held by initSinkFactory
, the Processor Tick will be blocked in needsStuckCheck
before it is released.
Therefore, a relatively simple solution is to let the needsStuckCheck
function no longer try to get m.sinkFactory.Lock()
, so no blocking will occur.
Based on advice from @hicqu, after the merge of https://github.com/pingcap/tiflow/pull/9852, the needsStuckCheck mechanism can be removed, which can also resolve this issue.
/found customer
What did you do?
What did you expect to see?
changefeed stucks but processor will not be stuck.
What did you see instead?
Versions of the cluster
Upstream TiDB cluster version (execute
SELECT tidb_version();
in a MySQL client):Upstream TiKV version (execute
tikv-server --version
):TiCDC version (execute
cdc version
):