changefeed lag reached 1398s when inject ticdc owner io hang last for 20mins

Lily2025 commented 5 months ago

What did you do?

1、run sysbench subType:"oltp_read_write" tableNum:64 tableSize:70000000 threads:32 2、inject ticdc owner io hang last for 20mins chaos start time：2024-04-02 04:22:11 chaos end time：2024-04-02 04:42:11

What did you expect to see?

changefeed lag less than 5mins

What did you see instead?

changefeed lag reached 1398s when inject ticdc owner io hang last for 20mins

Versions of the cluster

./cdc version Release Version: v8.1.0-alpha Git Commit Hash: 4024a44f010bc230fa4814826851e925370d88ae Git Branch: heads/refs/tags/v8.1.0-alpha UTC Build Time: 2024-04-01 13:11:50 Go Version: go version go1.21.6 linux/amd64 Failpoint Build: false

Lily2025 commented 5 months ago

/remove-area dm /area ticdc

flowbehappy commented 5 months ago

By design. I suggest we address it in long term.

fubinzh commented 5 months ago

/severity moderate

asddongmen commented 4 months ago

TiCDC uses disk to sort the data received from upstream TiKV. If a CDC server experiences an IO hang or slow IO issue, it cannot process the data in a timely manner. This leads to an increase in changefeed lag. Perhaps in the future, we might be able to detect disk IO issues and schedule the tables on this CDC node to other nodes to resolve the problem. But for now, it is by design and not a bug to be solved. cc @fubinzh @Lily2025

Wokraround If an I/O issue occurs in a CDC node, you can shut down the CDC node. The tables in this node will then be transferred to another CDC node for replication.

pingcap / tiflow