matrixorigin / matrixone

Hyperconverged cloud-edge native database
https://docs.matrixorigin.cn/en
Apache License 2.0
1.79k stars 277 forks source link

[Bug]:mo_cdc: Data consistency error #18964

Closed heni02 closed 4 weeks ago

heni02 commented 1 month ago

Is there an existing issue for the same bug?

Branch Name

main

Commit ID

a77a7350e

Other Environment Information

- Hardware parameters:
- OS type:
- Others:

Actual Behavior

测试场景:上游mo运行tpcc100仓1000并发测试15分钟,测试停止后1个小时验证下游mysql数据是否和上游数据一致 测试数据:tpcc 100仓一共10个表 测试现象:使用tpcc验证一致性的几个sql验证下游mysql数据返回结果和mo上游不一致 测试环境:上游mo minio 3cn,下游mysql单机

MO验证sql返回结果:(以下是预期结果)

企业微信截图_6a6532d5-00a6-4fcc-bc3d-16be48877dfd

mysql验证sql返回结果:(预期应该返回0行)

企业微信截图_d6bc35de-4bc3-4ab0-bd41-65fc9cd09ada 企业微信截图_0dada306-090a-4283-8f63-0cb8066372cb 企业微信截图_d69773c4-a18d-4108-b2e1-5d266b57d7db

同步任务信息:

企业微信截图_1c9a5c48-f5f3-4d28-8138-8394f092be73

mo log: http://10.222.6.1/explore?panes=%7B%22IjV%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-cdc-test%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221727150587995%22,%22to%22:%221727157760260%22%7D%7D%7D&schemaVersion=1&orgId=1

Expected Behavior

No response

Steps to Reproduce

1.mo使用mo-load-data工具加载tpcc100仓数据
 a.修改mo-load-data/access.yml文件,更新三个参数,具体值联系我
 b.执行命令加载数据:./load.sh -h 10.222.6.6 -c cases/02_from_cos/tpcc_100 -r -m
2.创建下游数据tpcc的10个表,ddl:https://github.com/matrixorigin/mo-load-data/tree/main/ddl/tpcc_no_partition
3.创建同步任务:
./mo_cdc task create --task-name "cdc_tpcc" --source-uri="mysql://dump:111@10.222.6.6:6001" --sink-type="mysql" --sink-uri="mysql://dump:111@10.222.1.129:3306"    --tables='tpcc_100.bmsql_config:test_cdc_db.bmsql_config,tpcc_100.bmsql_customer:test_cdc_db.bmsql_customer,tpcc_100.bmsql_district:test_cdc_db.bmsql_district,tpcc_100.bmsql_history:test_cdc_db.bmsql_history,tpcc_100.bmsql_item:test_cdc_db.bmsql_item,tpcc_100.bmsql_new_order:test_cdc_db.bmsql_new_order,tpcc_100.bmsql_oorder:test_cdc_db.bmsql_oorder,tpcc_100.bmsql_order_line:test_cdc_db.bmsql_order_line,tpcc_100.bmsql_stock:test_cdc_db.bmsql_stock,tpcc_100.bmsql_warehouse:test_cdc_db.bmsql_warehouse' --level="account"  --account="sys"
4.mo端使用mo-tpcc工具进行测试
a.修改mo-tpcc/props.mo里的conn,warehouses=100,terminals=1000,runMins=15值
b.执行测试:./runBenchmark.sh props.mo
5.tpcc测试结束后等待一段时间执行以下验证sql
(Select w_id, w_ytd from bmsql_warehouse) except (select d_w_id, sum(d_ytd) from bmsql_district group by d_w_id);

(Select d_w_id, d_id, D_NEXT_O_ID - 1 from bmsql_district)  except (select o_w_id, o_d_id, max(o_id) from bmsql_oorder group by  o_w_id, o_d_id);

(Select d_w_id, d_id, D_NEXT_O_ID - 1 from bmsql_district)  except (select no_w_id, no_d_id, max(no_o_id) from bmsql_new_order group by no_w_id, no_d_id);

(select o_w_id, o_d_id, sum(o_ol_cnt) from bmsql_oorder  group by o_w_id, o_d_id) except (select ol_w_id, ol_d_id, count(ol_o_id) from bmsql_order_line group by ol_w_id, ol_d_id);

(select d_w_id, sum(d_ytd) from bmsql_district group by d_w_id)  except(Select w_id, w_ytd from bmsql_warehouse);

Additional information

No response

ck89119 commented 1 month ago

从截图上看,有两张表没有同步任何数据

image

看起来应该是同步卡住或者报错了,我本地复现下试试

ck89119 commented 1 month ago

照着issue中的操作顺序本地跑了一遍,发现bmsql_stock表数据条数不一致,需要增加log排查

MO:

image

下游mysql:

image
ck89119 commented 1 month ago

加log排查,对一张只有update操作的表,发现读取某个时间段的insert和delete条数不一致

{"level":"ERROR","time":"2024/09/26 16:48:03.802503 +0800","caller":"cdc/reader.go:250","msg":"tableReader(bmsql_district)[1727340075084572000-1, 1727340476266060000-1] addTailEndStatistics: insert record count=4856, delete record count=4879"}
ck89119 commented 1 month ago

经过与存储同事的沟通,是接收数据排序上两边不一致,修改中

heni02 commented 4 weeks ago

confirm,closed commit:48898692b

企业微信截图_be97befa-c55a-4676-bef3-ed2d99d3f12c 企业微信截图_dd9a465e-2e88-46d8-b22d-647084d598ec 企业微信截图_beef5b46-3a03-4219-bfd4-86d2486ce330