[Bug]: restore 2T data from snapshot report 'table does not exist'.

Ariznawlll commented 1 week ago

Is there an existing issue for the same bug?

[X] I have checked the existing issues.

Branch Name

main

Commit ID

cf5296b

Other Environment Information

- Hardware parameters:
- OS type:
- Others:

Actual Behavior

恢复了大约5h后报错 table does not exist

mysql> restore account sys from snapshot sp01;

ERROR 1064 (HY000): SQL parser error: table "table_with_pk_index_for_write_1b" does not exist

日志：https://grafana.ci.matrixorigin.cn/explore?panes=%7B%223wf%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-big-data-20241016%5C%22%7D%20%7C%3D%20%60sp01%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221729137600000%22,%22to%22:%221729155600000%22%7D%7D%7D&schemaVersion=1&orgId=1

快照读能读到数据：

Expected Behavior

No response

Steps to Reproduce

步骤：
create snapshot sp01 for account sys ;
drop database big_data_test;
restore account sys from snapshot sp01;

big_data_test中有28张表，数据量大约有2T，找不到的表table_with_pk_index_for_write_1b的schema：
create table if not exists big_data_test.table_with_pk_index_for_write_1B( id bigint primary key, col1 tinyint, col2 smallint, col3 int, col4 bigint, col5 tinyint unsigned, col6 smallint unsigned, col7 int unsigned, col8 bigint unsigned, col9 float, col10 double, col11 varchar(255), col12 Date, col13 DateTime, col14 timestamp, col15 bool, col16 decimal(16,6), col17 text, col18 json, col19 blob, col20 binary(255), col21 varbinary(255), col22 vecf32(3), col23 vecf32(3), col24 vecf64(3), col25 vecf64(3));

Additional information

No response

YANGGMM commented 1 week ago

triump2020 commented 1 week ago

原因已大致定位到.

Ariznawlll commented 1 week ago

今天恢复也有这个问题：步骤与issue中提到的基本一样

mysql> select git_version();
+---------------+
| git_version() |
+---------------+
| 29a0c5d       |
+---------------+
1 row in set (0.00 sec)

企业微信截图_31fc874c-78f7-469c-a56b-0ecef8a03476

快照读能读到数据：

log：https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22-Ur%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-big-data-20241017%5C%22%7D%20%7C%3D%20%60txn%20is%20stale%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221729256400000%22,%22to%22:%221729260000000%22%7D%7D%7D&schemaVersion=1&orgId=1

triump2020 commented 6 days ago

PR is on the way!

triump2020 commented 3 days ago

PR 可能只解决了，导致这个问题的原因之一，但如果概率比较大，可能还有其他原因，需要加日志再复现下.

Ariznawlll commented 3 days ago

下午根据pitr恢复2T数据也报该错误

commit: 8d7e7b8
恢复执行的sql： restore from pitr p01 "2024-10-22 03:24:18.547701"

log：https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22MBn%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-big-data-20241021%5C%22%7D%20%7C%3D%20%60txn%20is%20stale%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221729587600000%22,%22to%22:%221729591200000%22%7D%7D%7D&schemaVersion=1&orgId=1

triump2020 commented 2 days ago

又完善了Log ，线上，线下同时在复现。应该是其他原因，导致了这个问题.

triump2020 commented 1 day ago

复现步骤：

修改以下配置： [tn.Ckp] flush-interval = "5s" min-count = 1 scan-interval = "5s" incremental-interval = "10s" global-min-count = 3
修改程序: gcPartitionStateTicker = 5 time.Second gcPartitionStateTimer = 90 time.Second

1729753427198

运行mo-service
运行 sql: 1>create table tpcc_1000.bmsql_order_line

2>load data url s3option {'endpoint'='http://cos.ap-guangzhou.myqcloud.com','access_key_id'='AKIDUtG3skpK1hK7BSoClmsDVegirATitKiD','secret_access_key'='pXGubPAxolknvyzsqEoRBteLzmbSH3pb','bucket'='mo-load-guangzhou-1308875761','filepath'='tpcc_1000/order-line.csv', 'compression'=''} into table tpcc_1000.bmsql_order_line fields terminated by ',' lines terminated by '\n' parallel 'true';

3> create snapshot 1; 4> drop database tpcc_1000; 5> restore account sys from snapshot sp01;

triump2020 commented 1 day ago

原因已定位，等待修复. 是 txn is stale 的错误，导致了报表找不到的错误. txn is stale 的原因是 partition state 的 minTs, start, end 的数据不一致导致.

triump2020 commented 1 day ago

又完善了Log ，线上，线下同时在复现。应该是其他原因，导致了这个问题.

经过线下复现，原因就是第一个pr 所修复的，只是修复失败.

triump2020 commented 1 day ago

由Txn is stale 导致的 table not found 问题应该修复了，线下测试过好多次了. @Ariznawlll 请测试.

triump2020 commented 4 hours ago

等待pr 合并

matrixorigin / matrixone