matrixorigin / matrixone

Hyperconverged cloud-edge native database
https://docs.matrixorigin.cn/en
Apache License 2.0
1.79k stars 276 forks source link

[Bug]: lots of error response " internal error: panic runtime error: slice bounds out of range [5927556:5925133]:" in sysbench mixed cased during stability test on distributed mode #16884

Closed aressu1985 closed 5 months ago

aressu1985 commented 5 months ago

Is there an existing issue for the same bug?

Branch Name

1.2-dev

Commit ID

d807366

Other Environment Information

- Hardware parameters:
3*CN: 16C 64G
1*DN: 16C 64G
3*LOG: 4C 16G
2*PROXY: 3C 6G
- OS type:
- Others:

Actual Behavior

there are lots of error response " internal error: panic runtime error: slice bounds out of range [5927556:5925133]:" in sysbench mixed cased during stability test on distributed mode

image

mo-log: https://shanghai.idc.matrixorigin.cn:30001/explore?panes=%7B%22hJL%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-d807366-20240612223759%5C%22%7D%20%7C%3D%20%60slice%20bounds%20out%20of%20range%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221718242973718%22,%22to%22:%221718246821491%22%7D%7D%7D&schemaVersion=1&orgId=1

sysbench mixed case sql: session1: select c from sbtest{tbx} where id = {id};

session2: delete from sbtest{tbx} where id = {id}; insert into sbtest{tbx} values({id},4993,'83868641912-28773972837-60736120486-75162659906-27563526494-20381887404-41576422241-93426793964-56405065102-33518432330','67847967377-48000963322-62604785301-91415491898-96926520291');

session3: UPDATE sbtest{tbx} SET k=k+100 WHERE id = {id};

Expected Behavior

No response

Steps to Reproduce

1. run a mo cluster with config in this issue
2. run tpch 10G loop test processes in one independant tenant
3. run tpcc 10 warehouse and 10 ternimals longrunnig test processes in one independant tenant, prepare mode
4. run sysbench mixed cases(insert/delete/update/select) longrunnig test processes with 75 terminals in one independant tenant,non-prepare mode
5. run another sysbench mixed cases(insert/delete/update/select) longrunnig test processe with  75 terminals in one independant tenant,non-prepare mode

Additional information

No response

LeftHandCold commented 5 months ago

出现这个问题的情况只有一个cn,而且读的文件是一个delete文件,正常dn也会读这个文件,并没有出现panic,只有这个cn出现panic,并且出现的时间只有上午9点59到10点20,此时diskcache有大量的error,后面恢复正常,猜测此时diskcache已经把这些出问题的文件evict,所以恢复了正常,所以怀疑这个cn的diskcache中的文件当时是有问题的。

LeftHandCold commented 5 months ago
image
LeftHandCold commented 5 months ago

filelist.log 查看当前cn diskcache内容,最早的文件也是下午2点以后了 cn异常的范围如下: https://shanghai.idc.matrixorigin.cn:30001/explore?panes=%7B%22hJL%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-d807366-20240612223759%5C%22,%20pod%3D%5C%22stability-regression-dis-tp-cn-g4b4b%5C%22%7D%20%7C%3D%20%60panic%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221718235835503%22,%22to%22:%221718251466136%22%7D%7D%7D&schemaVersion=1&orgId=1

aressu1985 commented 5 months ago

需要进一步定位,在写cache成功后,写S3失败,之后的处理逻辑是不是有问题

LeftHandCold commented 5 months ago

system1.log 这个case是系统异常引起的,出现的时候09:59正好宿主机内核出现了filemap_get_page异常,导致pagecache异常,和timer_interrupt。并且持续到10:17分第一次soft lockup。

image image