matrixorigin / matrixone

Hyperconverged cloud-edge native database
https://docs.matrixorigin.cn/en
Apache License 2.0
1.79k stars 276 forks source link

[Bug]: [date 5.29]tke regression: sysbench 1000w delete test reported panic error runtime.goPanicIndex #16502

Closed heni02 closed 4 months ago

heni02 commented 6 months ago

Is there an existing issue for the same bug?

Branch Name

main

Commit ID

8b019d284

Other Environment Information

- Hardware parameters:
- OS type:
- Others:

Actual Behavior

job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9288081660/job/25582868306

FATAL: mysql_stmt_execute() returned error 20101 (internal error: panic runtime error: index out of range [-1]: runtime.goPanicIndex /usr/local/go/src/runtime/panic.go:114 github.com/matrixorigin/matrixone/pkg/pb/pipeline.encodeVarintPipeline /go/src/github.com/matrixorigin/matrixone/pkg/pb/pipeline/pipeline.pb.go:9785 github.com/matrixorigin/matrixone/pkg/pb/pipeline.(Pipeline).MarshalToSizedBuffer /go/src/github.com/matrixorigin/matrixone/pkg/pb/pipeline/pipeline.pb.go:9683 github.com/matrixorigin/matrixone/pkg/pb/pipeline.(Pipeline).Marshal /go/) for query 'DELETE FROM sbtest10 WHERE id=?' FATAL: `thread_run' function failed: oltp_delete.lua:33: SQL error, errno = 20101, state = 'HY000': internal error: panic runtime error: index out of range [-1]: runtime.goPanicIndex /usr/local/go/src/runtime/panic.go:114 github.com/matrixorigin/matrixone/pkg/pb/pipeline.encodeVarintPipeline /go/src/github.com/matrixorigin/matrixone/pkg/pb/pipeline/pipeline.pb.go:9785 github.com/matrixorigin/matrixone/pkg/pb/pipeline.(Pipeline).MarshalToSizedBuffer /go/src/github.com/matrixorigin/matrixone/pkg/pb/pipeline/pipeline.pb.go:9683 github.com/matrixorigin/matrixone/pkg/pb/pipeline.(Pipeline).Marshal /go/

mo log: https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22jV9%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-regression-20240529%5C%22%7D%20%7C%3D%20%60panic%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221717035752021%22,%22to%22:%221717039571624%22%7D%7D%7D&schemaVersion=1&orgId=1

Expected Behavior

No response

Steps to Reproduce

tke sysbench 1000w delete 100/1000 threads test

Additional information

No response

triump2020 commented 6 months ago

The stack :

1717051811664

1717051855952

@m-schen Can u kindly take a look ?

m-schen commented 6 months ago

日志相关panic有以下两个

1.

{"level":"ERROR","time":"2024/05/30 03:21:56.727881 +0000","name":"cn-service.txn","caller":"compile/scope.go:321","msg":"panic in scope run","uuid":"30323939-3435-3164-3962-363262646239","sql":"","error":"internal error: panic runtime error: index out of range [1] with length 1: 
runtime.goPanicIndex\n\t/usr/local/go/src/runtime/panic.go:114
github.com/matrixorigin/matrixone/pkg/sql/colexec.GetExprZoneMap
/go/src/github.com/matrixorigin/matrixone/pkg/sql/colexec/evalExpression.go:1077
github.com/matrixorigin/matrixone/pkg/sql/compile.ApplyRuntimeFilters
go/src/github.com/matrixorigin/matrixone/pkg/sql/compile/runtime_filter.go:126
github.com/matrixorigin/matrixone/pkg/sql/compile.(*Scope).handleRuntimeFilter\n\t/go/src/github.com/matrixorigin/matrixone/pkg/sql/compile/scope.go:798
github.com/matrixorigin/matrixone/pkg/sql/compile.buildScanParallelRun
/go/src/github.com/matrixorigin/matrixone/pkg/sql/compile/scope.go:469
github.com/matrixorigin/matrixone/pkg/sql/compile.(*Scope).ParallelRun\n\t/go/src/github.com/matrixorigin/matrixone/pkg/sql/compile/scope.go:346
github.com/matrixorigin/matrixone/pkg/sql/compile.(*Scope).RemoteRun\n\t/go/src/github.com/matrixorigin/matrixone/pkg/sql/compile/scope.go:291
github.com/matrixorigin/matrixone/pkg/sql/compile.(*Scope).MergeRun.func1
/go/src/github.com/matrixorigin/matrixone/pkg/sql/compile/scope.go:223
github.com/panjf2000/ants/v2.(*goWorker).run.func1\n\t/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.4/worker.go:67\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650","session_id":"018fc784-6ada-732e-82c4-12ced49bc196","statement_id":"018fc784-707c-7ab5-8e05-0d9c157fc1c5","txn_id":"018fc784707c7af48ec1b14c4ab3b114","span":{"trace_id":"67b508b0-11b3-eb9e-1300-c1a62fe291ad","span_id":"5525b5a2f91ad0df","kind":"remote"}}

2.

{"level":"ERROR","time":"2024/05/30 02:25:42.948595 +0000","name":"cn-service.txn","caller":"compile/scopeRemoteRun.go:317","msg":"panic in scope remoteRun","uuid":"38613532-6130-3030-3066-343035353639","sql":"execute __mo_stmt_id_10","error":"internal error: panic runtime error: index out of range [-1]: 
runtime.goPanicIndex\n\t/usr/local/go/src/runtime/panic.go:114
github.com/matrixorigin/matrixone/pkg/pb/pipeline.encodeVarintPipeline
go/src/github.com/matrixorigin/matrixone/pkg/pb/pipeline/pipeline.pb.go:9785
github.com/matrixorigin/matrixone/pkg/pb/pipeline.(*Pipeline).MarshalToSizedBuffer
/go/src/github.com/matrixorigin/matrixone/pkg/pb/pipeline/pipeline.pb.go:9683
github.com/matrixorigin/matrixone/pkg/pb/pipeline.(*Pipeline).Marshal
/go/src/github.com/matrixorigin/matrixone/pkg/pb/pipeline/pipeline.pb.go:9528
github.com/matrixorigin/matrixone/pkg/sql/compile.encodeScope
/go/src/github.com/matrixorigin/matrixone/pkg/sql/compile/scopeRemoteRun.go:377
github.com/matrixorigin/matrixone/pkg/sql/compile.(*Scope).remoteRun\n\t/go/src/github.com/matrixorigin/matrixone/pkg/sql/compile/scopeRemoteRun.go:341\ngithub.com/matrixorigin/matrixone/pkg/sql/compile.(*Scope).RemoteRun\n\t/go/src/github.com/matrixorigin/matrixone/pkg/sql/compile/scope.go:300
github.com/matrixorigin/matrixone/pkg/sql/compile.(*Scope).MergeRun.func1
/go/src/github.com/matrixorigin/matrixone/pkg/sql/compile/scope.go:223
github.com/panjf2000/ants/v2.(*goWorker).run.func1
/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.4/worker.go:67
runtime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650","session_id":"018fc750-c1fe-76f2-8f18-658db02f1545","statement_id":"018fc751-4416-7946-b9da-2541cc9566d4","txn_id":"018fc7514416798aae29bd01cace9054","span":{"trace_id":"9021d205-1586-da8f-4e06-801bd634c006","span_id":"43a90171766e219a"}}

感觉都是加了defer去捕获panic后暴露出来的问题,之前应该也有这样的错误。

m-schen commented 6 months ago

@badboynt1 第一个panic需要你帮忙确认一下,runtime有没有可能会推下来一个出错的表达式。

m-schen commented 6 months ago

该错误似乎只能是race导致的,某个属性在序列化过程中遭到了修改。


暂时猜测是pipeline中的多个算子的Argument用的是同一个对象,而另一部分由于某些原因(如parallel run等)不再需要这个内存,将其release了,导致被其他算子拿去用,因此在序列化过程中被修改。

如以下pipeline: -> merge order -> output

  1. scan -> projection -> order -> send to merge
  2. scan -> projection -> order -> send to merge

其中2是remote run, 但是1在本地展开执行,展开过程中可能projection不再使用,因此release了。

暂时还是猜测,需要进行确认。

m-schen commented 5 months ago

没改好,没定位到具体race的地方,明天继续,。

m-schen commented 5 months ago

搁置一下,找不到有race的地方妈的。

m-schen commented 5 months ago

这个今天没有相关进展,今天早上在看mpool oom的问题。下午电脑坏了

ouyuanning commented 5 months ago

等明松回来再一起分析

ouyuanning commented 5 months ago

1的部分,如果scope有data race,修改了expr的内容。可以解释得通 2的部分,没有找到什么情形下的data race会有这样的可能

ouyuanning commented 4 months ago

在处理prepare

ouyuanning commented 4 months ago

根据昨天跟明松的讨论。预计可能是scope内部的算子的属性的race 要是那样的话,那应该跟之前filter算子的expr race类似。甚至可能是相同的问题。 待有空再筛查一次,看还有没有其他地方会在运行期间更改scope属性的。

ouyuanning commented 4 months ago

还没空筛查

ouyuanning commented 4 months ago

判断是偶发的data race。目前scope及pipeline生命周期问题做了大量的重构。估计已处理,可以再观察看看

heni02 commented 4 months ago

没有再出现,closed