pingcap / tiflow

This repo maintains DM (a data migration platform) and TiCDC (change data capture for TiDB)
Apache License 2.0
430 stars 286 forks source link

add index can not sync to downstream after simulate network partition between upstream and downstream #11502

Open Lily2025 opened 3 months ago

Lily2025 commented 3 months ago

What did you do?

1、restore data for primary and secondary 2、create changefeed and set bdr role for primary and secondary 3、run sysbench on primary and secondary 4、add index a on primary 5、simulate network partition between upstream and downstream faultType: network_partition selector: tc-ticdc(all)_to_cdc-downstream-tc-tidb(all) warmUpTime: 1m period: "@every 5m" faultDuration: 3m faultTotalRunTime: 30m

log: ticdc-0.zip

What did you expect to see?

after fault recover,add index can sync to downstream

What did you see instead?

after fault recover,add index can not sync to downstream upstream: img_v3_02dv_81a75d54-84f7-4e5c-bd9d-1be59944696g

dowdownstream: img_v3_02dv_d08e0164-c428-4181-9197-5163da7b86ag

img_v3_02dv_be3ba049-65f3-4947-b874-23c2e09a2e5g

logs: [2024/08/20 13:55:42.098 +08:00] [INFO] [ddl_sink.go:258] ["Execute DDL succeeded"] [namespace=default] [changefeed=ticdc-task1] [DDL="{\"StartTs\":451971109864079467,\"CommitTs\":451971110139330738,\"Query\":\"ALTER TABLE sbtest5 ADD INDEX index_test_1724132823395(c)\",\"TableInfo\":{\"id\":325,\"name\":{\"O\":\"sbtest5\",\"L\":\"sbtest5\"},\"charset\":\"utf8mb4\",\"collate\":\"utf8mb4_bin\",\"cols\":[{\"id\":1,\"name\":{\"O\":\"id\",\"L\":\"id\"},\

[2024/08/20 13:56:35.529 +08:00] [WARN] [mysql_ddl_sink.go:152] ["Execute DDL with error, retry later"] [startTs=451971109864079467] [ddl="ALTER TABLE sbtest5 ADD INDEX index_test_1724132823395(c)"] [namespace=default] [changefeed=ticdc-task1] [error="dial tcp 10.101.73.218:4000: operation was canceled"]

Versions of the cluster

./cdc version Release Version: v8.3.0-alpha Git Commit Hash: e3c75b756bedcaa39285ddd6d370b3877d7433d3 Git Branch: heads/refs/tags/v8.3.0-alpha UTC Build Time: 2024-08-19 11:36:49 Go Version: go1.21.10 Failpoint Build: false 2024-08-20T10:25:02.344+0800

current status of DM cluster (execute query-status <task-name> in dmctl)

No response

Lily2025 commented 3 months ago

/remove-area dm /area ticdc

Lily2025 commented 3 months ago

/assign sdojjy

sdojjy commented 3 months ago

As we can see from the below logs, ticdc async execute add index ddl and wait 10s, The background execution is not finished after 10s, so the ticdc advanced the checkpoint ts, but the ddl is not submitted to downstream tidb, because the network partition.

[2024/08/20 13:55:32.098 +08:00] [INFO] [async_ddl.go:51] ["async exec add index ddl start"] [changefeedID=default/ticdc-task1] [commitTs=451971110139330738] [ddl="ALTER TABLE `sbtest5` ADD INDEX `index_test_1724132823395`(`c`)"]

[2024/08/20 13:55:42.098 +08:00] [INFO] [async_ddl.go:87] ["async add index ddl is still running"] [changefeedID=default/ticdc-task1] [commitTs=451971110139330738] [ddl="ALTER TABLE `sbtest5` ADD INDEX `index_test_1724132823395`(`c`)"]

[2024/08/20 13:56:35.530 +08:00] [ERROR] [async_ddl.go:57] ["async exec add index ddl failed"] [changefeedID=default/ticdc-task1] [commitTs=451971110139330738] [ddl="ALTER TABLE `sbtest5` ADD INDEX `index_test_1724132823395`(`c`)"]
fubinzh commented 3 months ago

/severity major

flowbehappy commented 1 week ago

Will further investigate the issue on the new arch ticdc https://github.com/pingcap/ticdc. Won't fix on the current repo.