matrixorigin / matrixone

Hyperconverged cloud-edge native database
https://docs.matrixorigin.cn/en
Apache License 2.0
1.71k stars 265 forks source link

[Bug]: cn crashed by FATAL "update incr record returns invalid affected rows" during stability test on distributed mode #16982

Open aressu1985 opened 2 weeks ago

aressu1985 commented 2 weeks ago

Is there an existing issue for the same bug?

Branch Name

1.2-dev

Commit ID

20f9de6

Other Environment Information

- Hardware parameters:
3*CN: 16C 64G
1*DN: 16C 64G
3*LOG: 4C 16G
3*PROXY 3C 7G
- OS type:
- Others:

Actual Behavior

cn crashed by FATAL "update incr record returns invalid affected rows" during stability test on distributed mode. [github@mo-srv-128 9567375395_nightly-20f9de6]$ kubectl -n mo-nightly-20f9de6-202406182258 logs -p stability-regression-dis-tp-cn-b4978 | grep FATAL {"level":"FATAL","time":"2024/06/18 18:26:37.398981 +0000","name":"cn-service.incrservice","caller":"incrservice/store_sql.go:225","msg":"BUG: update incr record returns invalid affected rows","uuid":"32653035-3834-3564-3430-656339346131","update-sql":"update mo_increment_columns set offset = 38135697 \n\t\t\t\twhere table_id = 272500 and col_name = 'mo_fake_pk_col' and offset = 38125697","account":0,"table":272500,"col":"mo_fake_pk_col","affected-rows":0,"select-all":"Cannot find tableID 272500 in table mo_increment_columns, accountid 0, txn: 01902c90ac257aafaf915b7d4e37989a/Active/S:1718734593997874275-1\tcol_name: mo_fake_pk_col, table_id: 272499\n\tcol_name: mo_fake_pk_col, table_id: 272412\n\tcol_name: mo_fake_pk_col, table_id: 272500\n\tcol_name: task_id, table_id: 272394\n\tcol_name: id, table_id: 272402\n\tcol_name: id, table_id: 272403\n\tcol_name: id, table_id: 272407\n\tcol_name: id, table_id: 272408\n\tcol_name: id, table_id: 272409\n\tcol_name: mo_fake_pk_col, table_id: 272411\n\tcol_name: user_id, table_id: 272415\n\tcol_name: account_id, table_id: 272417\n\tcol_name: version, table_id: 272417\n\tcol_name: role_id, table_id: 272419\n\tcol_name: function_id, table_id: 272424\n\tcol_name: configuration_id, table_id: 272426\n\tcol_name: proc_id, table_id: 272430\n\tcol_name: stage_id, table_id: 272432\n\tcol_name: mo_fake_pk_col, table_id: 272450\n\tcol_name: __mo_fake_pk_col, table_id: 272452\n\tcol_name: mo_fake_pk_col, table_id: 272454\n\tcol_name: mo_fake_pk_col, table_id: 272456\n\tcol_name: __mo_fake_pk_col, table_id: 272457\n\tcol_name: mo_fake_pk_col, table_id: 272463\n\tcol_name: mo_fake_pk_col, table_id: 272464\n\tcol_name: __mo_fake_pk_col, table_id: 272465\n\tcol_name: mo_fake_pk_col, table_id: 272466\n\tcol_name: mo_fake_pk_col, table_id: 272467\n\tcol_name: __mo_fake_pk_col, table_id: 272468\n\tcol_name: mo_fake_pk_col, table_id: 272469\n\tcol_name: mo_fake_pk_col, table_id: 272470\n\tcol_name: __mo_fake_pk_col, table_id: 272471\n\tcol_name: mo_fake_pk_col, table_id: 272472\n\tcol_name: mo_fake_pk_col, table_id: 272473\n\tcol_name: __mo_fake_pk_col, table_id: 272405\n\tcol_name: mo_fake_pk_col, table_id: 272406\n\tcol_name: __mo_fake_pk_col, table_id: 272410\n\tcol_name: cron_task_id, table_id: 272392\n\tcol_name: mo_fake_pk_col, table_id: 272476\n\tcol_name: task_id, table_id: 272390\n\tcol_name: mo_fake_pk_col, table_id: 272475\n","cost":"10m0.121435844s","ctx-done":true,"stacktrace":"github.com/matrixorigin/matrixone/pkg/incrservice.(sqlStore).Allocate.func2\n\t/go/src/github.com/matrixorigin/matrixone/pkg/incrservice/store_sql.go:225\ngithub.com/matrixorigin/matrixone/pkg/sql/compile.(sqlExecutor).ExecTxn\n\t/go/src/github.com/matrixorigin/matrixone/pkg/sql/compile/sql_executor.go:136\ngithub.com/matrixorigin/matrixone/pkg/incrservice.(sqlStore).Allocate\n\t/go/src/github.com/matrixorigin/matrixone/pkg/incrservice/store_sql.go:160\ngithub.com/matrixorigin/matrixone/pkg/incrservice.(allocator).doAllocate\n\t/go/src/github.com/matrixorigin/matrixone/pkg/incrservice/allocator.go:164\ngithub.com/matrixorigin/matrixone/pkg/incrservice.(allocator).run\n\t/go/src/github.com/matrixorigin/matrixone/pkg/incrservice/allocator.go:151\ngithub.com/matrixorigin/matrixone/pkg/common/stopper.(Stopper).doRunCancelableTask.func1\n\t/go/src/github.com/matrixorigin/matrixone/pkg/common/stopper/stopper.go:277"}

mo log link: https://shanghai.idc.matrixorigin.cn:30001/explore?panes=%7B%22mY0%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-20f9de6-202406182258%5C%22%7D%20%7C%3D%20%60update%20incr%20record%20returns%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221718733455244%22,%22to%22:%221718761993073%22%7D%7D%7D&schemaVersion=1&orgId=1

Expected Behavior

No response

Steps to Reproduce

1. run a mo cluster with config in this issue
2. run tpch 10G loop test processes in one independant tenant
3. run tpcc 10 warehouse and 10 ternimals longrunnig test processes in one independant tenant, prepare mode
4. run sysbench mixed cases(insert/delete/update/select) longrunnig test processes with 75 terminals in one independant tenant,non-prepare mode
5. run another sysbench mixed cases(insert/delete/update/select) longrunnig test processe with  75 terminals in one independant tenant,non-prepare mode

Additional information

No response

ouyuanning commented 2 weeks ago

https://shanghai.idc.matrixorigin.cn:30001/explore?panes=%7B%22bg_%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-20f9de6-202406182258%5C%22%7D%20%7C%3D%20%60where%20table_id%20%3D%20272500%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221718734199000%22,%22to%22:%221718734634000%22%7D%7D%7D&schemaVersion=1&orgId=1

image

从这里表面上看是: 1、01902c9066a377b183214f3e745d6c17 这个事务,在 【2024-06-19 02:16:19.590】先拿到了 38135697 这个offset。并且在【2024-06-19 02:16:26.715】先完成了update 01902c90ac257aafaf915b7d4e37989a 这个事务,在【2024-06-19 02:16:37.277】又拿到了 38135697 这个offset,但是update失败报错了

ouyuanning commented 2 weeks ago

https://shanghai.idc.matrixorigin.cn:30001/explore?panes=%7B%22bg_%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-20f9de6-202406182258%5C%22%7D%20%7C%3D%20%6001902c9066a377b183214f3e745d6c17%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221718734199000%22,%22to%22:%221718734634000%22%7D%7D%7D&schemaVersion=1&orgId=1

实际上 01902c9066a377b183214f3e745d6c17 这个事务并没有执行成功 image

zhangxu19830126 commented 2 weeks ago

这个事务的select for update的锁被孤儿事务检测给释放了,所以update返回的行数是0,因为数据被其他事务修改了。如果遇到孤儿事务,需要重试。

zhangxu19830126 commented 2 weeks ago

fixed

aressu1985 commented 2 weeks ago

未解决 commitid: f27a435

mo-log: https://shanghai.idc.matrixorigin.cn:30001/explore?panes=%7B%22eRp%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-f27a435-202406210006%5C%22%7D%20%7C%3D%20%60update%20incr%20record%20retur%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221718962572528%22,%22to%22:%221718963459801%22%7D%7D%7D&schemaVersion=1&orgId=1

zhangxu19830126 commented 1 week ago

这个不是孤儿事务的问题了,是update返回影响行数是0

ouyuanning commented 1 week ago

还没时间看

ouyuanning commented 1 week ago

https://shanghai.idc.matrixorigin.cn:30001/explore?panes=%7B%22eRp%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-f27a435-202406210006%5C%22%7D%20%7C%3D%20%60set%20offset%20%3D%20198750705%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221718961612000%22,%22to%22:%221718963039000%22%7D%7D%7D&schemaVersion=1&orgId=1

image

表现为: 1、pod: hqsdj
2024/06/21 09:32:06.836913 事务01903a2366b37d2381c2290220763ef6 已经更新过一次了。

2、同一个pod 2024/06/21 09:32:10.601954 事务01903a23949c7a018fef05d88aa856f7 再去更新失败。但是不知道为啥10分钟之后才log error

ouyuanning commented 1 week ago

辛苦俊洪和罗飞帮忙看看

iamlinjunhong commented 6 days ago

在看其他 s-1

iamlinjunhong commented 5 days ago

在看其他 s-1

iamlinjunhong commented 4 days ago

在看其他 s-1

iamlinjunhong commented 3 days ago

在看其他 s-1

aressu1985 commented 3 days ago

update on 7.04: [github@mo-srv-128 mo-tpcc]$ kubectl -n mo-e4d45ce-202407032138 logs -p stability-regression-dis-tp-cn-kd47b | grep FATAL {"level":"FATAL","time":"2024/07/04 04:55:46.106871 +0000","name":"cn-service.incrservice","caller":"incrservice/store_sql.go:193","msg":"BUG: update incr record returns invalid affected rows","uuid":"31343732-3665-6432-3538-383965363232","update-sql":"update mo_increment_columns set offset = 174042324 \n\t\t\t\twhere table_id = 272500 and col_name = '__mo_fake_pk_col' and offset = 174032324","account":0,"table":272500,"col":"__mo_fake_pk_col","affected-rows":0,"cost":"1.910313124s","ctx-done":false,"stacktrace":"github.com/matrixorigin/matrixone/pkg/incrservice.(sqlStore).Allocate.func2\n\t/go/src/github.com/matrixorigin/matrixone/pkg/incrservice/store_sql.go:193\ngithub.com/matrixorigin/matrixone/pkg/sql/compile.(sqlExecutor).ExecTxn\n\t/go/src/github.com/matrixorigin/matrixone/pkg/sql/compile/sql_executor.go:136\ngithub.com/matrixorigin/matrixone/pkg/incrservice.(sqlStore).Allocate\n\t/go/src/github.com/matrixorigin/matrixone/pkg/incrservice/store_sql.go:127\ngithub.com/matrixorigin/matrixone/pkg/incrservice.(allocator).doAllocate\n\t/go/src/github.com/matrixorigin/matrixone/pkg/incrservice/allocator.go:164\ngithub.com/matrixorigin/matrixone/pkg/incrservice.(allocator).run\n\t/go/src/github.com/matrixorigin/matrixone/pkg/incrservice/allocator.go:151\ngithub.com/matrixorigin/matrixone/pkg/common/stopper.(Stopper).doRunCancelableTask.func1\n\t/go/src/github.com/matrixorigin/matrixone/pkg/common/stopper/stopper.go:277"}

https://shanghai.idc.matrixorigin.cn:30001/explore?panes=%7B%22KrG%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-e4d45ce-202407032138%5C%22%7D%20%7C%3D%20%60update%20incr%20record%20returns%20invalid%20affected%20rows%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221720068567652%22,%22to%22:%221720069408320%22%7D%7D%7D&schemaVersion=1&orgId=1

iamlinjunhong commented 2 days ago

复现的结果是读不到 mo_increment_columns 数据,需要 @triump2020 一起看看

截屏2024-07-05 18 37 00