matrixorigin / matrixone

Hyperconverged cloud-edge native database
https://docs.matrixorigin.cn/en
Apache License 2.0
1.77k stars 275 forks source link

[Bug]: mo panic by internal error: driver info: retry time out during stability test on standlone mode #12221

Open aressu1985 opened 11 months ago

aressu1985 commented 11 months ago

Is there an existing issue for the same bug?

Environment

- Version or commit-id (e.g. v0.1.0 or 8b23a93):fb5bed90c374a32f793629a002ff97a2b39ed597
- Hardware parameters:
- OS type:
- Others:

Actual Behavior

the panic log : {"level":"ERROR","time":"2023/10/20 01:36:06.449989 +0800","name":"hakeeper-client-backend","caller":"morpc/backend.go:545","msg":"read loop stopped","remote":"127.0.0.1:32001","backend-id":"d52bceb7-0b5e-4328-9d7b-f5079de2ead7"} {"level":"INFO","time":"2023/10/20 01:36:06.442867 +0800","name":"rpc-client[hakeeper-client([connectToHAKeeper])]","caller":"morpc/client.go:343","msg":"gc idle backends task started"} {"level":"INFO","time":"2023/10/20 01:36:06.431010 +0800","caller":"disttae/txn.go:668","msg":"transaction commit: 1cf17d9075444cd6ae64da4386842d1a/Active/S:1697730556552701185-1\n"} {"level":"WARN","time":"2023/10/20 01:36:06.371792 +0800","name":"gossip","caller":"registry/gossip_logger.go:44","msg":"memberlist: Failed to push local state: write tcp 127.0.0.1:32002->127.0.0.1:46712: i/o timeout from=127.0.0.1:46712"} {"level":"WARN","time":"2023/10/20 01:36:06.478544 +0800","name":"dragonboat","caller":"v4@v4.0.0-20230426084722-d189534f8004/node.go:1398","msg":"[00000:31072] had 12 LocalTick msgs in one batch"} panic: internal error: driver info: retry time out [recovered] panic: internal error: driver info: retry time out

goroutine 5103827 [running]: github.com/matrixorigin/matrixone/pkg/vm/engine/tae/logstore/driver/logservicedriver.NewLogServiceDriver.func1({0x31376a0?, 0xc1ec735010?}) /mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/matrixone/pkg/vm/engine/tae/logstore/driver/logservicedriver/driver.go:89 +0x25 github.com/panjf2000/ants/v2.(goWorker).run.func1.1() /home/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.4/worker.go:54 +0x75 panic({0x31376a0, 0xc1ec735010}) /usr/local/go/src/runtime/panic.go:884 +0x213 go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x0?, 0x5899e80?, {0x0?, 0x0?, 0xc0a36e80e0?}) /home/go/pkg/mod/go.uber.org/zap@v1.24.0/zapcore/entry.go:198 +0x65 go.uber.org/zap/zapcore.(CheckedEntry).Write(0xc331846680, {0x0, 0x0, 0x0}) /home/go/pkg/mod/go.uber.org/zap@v1.24.0/zapcore/entry.go:264 +0x3ec go.uber.org/zap.(Logger).Panic(0xc2a48a9400?, {0xc00064cab0?, 0x0?}, {0x0, 0x0, 0x0}) /home/go/pkg/mod/go.uber.org/zap@v1.24.0/logger.go:258 +0x59 github.com/matrixorigin/matrixone/pkg/logutil.Panic({0xc00064cab0?, 0x21?}, {0x0?, 0x1?, 0x1?}) /mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/matrixone/pkg/logutil/api.go:41 +0x8b github.com/matrixorigin/matrixone/pkg/vm/engine/tae/logstore/driver/logservicedriver.(driverAppender).append(0xc183b05e80, 0xc00bd80fa8?, 0x2540be400) /mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/matrixone/pkg/vm/engine/tae/logstore/driver/logservicedriver/appender.go:101 +0x7e5 github.com/matrixorigin/matrixone/pkg/vm/engine/tae/logstore/driver/logservicedriver.(LogServiceDriver).onAppendQueue.func1() /mnt/datadisk0/actions-runner/_work/mo-nightly-regression/mo-nightly-regression/matrixone/pkg/vm/engine/tae/logstore/driver/logservicedriver/append.go:67 +0x2d github.com/panjf2000/ants/v2.(goWorker).run.func1() /home/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.4/worker.go:67 +0x97 created by github.com/panjf2000/ants/v2.(*goWorker).run /home/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.4/worker.go:48 +0x65

the whole log: mo-service-panic.tar.gz

Expected Behavior

No response

Steps to Reproduce

1. run a mo server on standlone mode 
2. run bvt test loop
3. run tpch_10 test loop
4. run sysbench mixed test 
5. run tpcc 10 warehouse 10 terminal

or
1.run a mo server on standlone mode
2.git https://github.com/matrixorigin/mo-nightly-regression.git
3. ./stb_test.sh -c /bvtcasepath/

only for linux

Additional information

No response

volgariver6 commented 11 months ago

在 standalone 模式下,由于磁盘IO占用问题,就是会出现这个情况

volgariver6 commented 11 months ago

无进展

volgariver6 commented 11 months ago

无进展

heni02 commented 9 months ago

date 12.18 standalone regresson reproduce this problem job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/7249177604/job/19746773876

企业微信截图_0d1fd30a-63fc-437a-86bb-d811afa7a2bb

mo log:

企业微信截图_7f44148d-0a4c-4296-92b0-988f9202c25f 企业微信截图_545cc0e2-8f07-4401-ab3f-c2b6d34dc12f

log太大,私下发 当时cpu mem使用情况,mem基本占满内存

企业微信截图_76ced25b-84e1-4928-a48a-9da1e4f23822

profile: profile (3).tar.gz

volgariver6 commented 8 months ago

先改为s1,空了再处理这个问题

volgariver6 commented 7 months ago

改成s0,因为有一些用户在使用单机版本

volgariver6 commented 6 months ago

no process

volgariver6 commented 6 months ago

no process

volgariver6 commented 6 months ago

no process

volgariver6 commented 6 months ago

no process

volgariver6 commented 6 months ago

no process

volgariver6 commented 5 months ago

no process

volgariver6 commented 5 months ago

no process

guguducken commented 5 months ago

分布式环境repro: https://github.com/matrixorigin/matrixone/actions/runs/8645981323

企业微信截图_4062aa9f-8155-4ec0-8d4b-d6b940607c41
volgariver6 commented 5 months ago

no process

volgariver6 commented 5 months ago

no process

daviszhen commented 4 months ago

pengzhen@pengzhen:~/Documents/temp/matrixone-temp$ ./mo-service -debug-http 127.0.0.1:6060 -launch etc/launch/launch.toml > log.txt 2024/05/08 15:15:45 maxprocs: Leaving GOMAXPROCS=16: CPU quota undefined [mysql] 2024/05/08 15:40:08 packets.go:37: read tcp 127.0.0.1:58362->127.0.0.1:6001: i/o timeout [mysql] 2024/05/08 15:40:25 packets.go:37: read tcp 127.0.0.1:55816->127.0.0.1:6001: i/o timeout [mysql] 2024/05/08 15:41:08 packets.go:37: read tcp 127.0.0.1:38246->127.0.0.1:6001: i/o timeout [mysql] 2024/05/08 15:41:10 packets.go:37: read tcp 127.0.0.1:38252->127.0.0.1:6001: i/o timeout [mysql] 2024/05/08 15:41:20 packets.go:37: read tcp 127.0.0.1:50096->127.0.0.1:6001: i/o timeout [mysql] 2024/05/08 15:41:40 packets.go:37: read tcp 127.0.0.1:32986->127.0.0.1:6001: i/o timeout [mysql] 2024/05/08 15:42:08 packets.go:37: read tcp 127.0.0.1:59418->127.0.0.1:6001: i/o timeout [mysql] 2024/05/08 15:42:10 packets.go:37: read tcp 127.0.0.1:59422->127.0.0.1:6001: i/o timeout [mysql] 2024/05/08 15:42:20 packets.go:37: read tcp 127.0.0.1:34220->127.0.0.1:6001: i/o timeout [mysql] 2024/05/08 15:42:30 packets.go:37: read tcp 127.0.0.1:60932->127.0.0.1:6001: i/o timeout [mysql] 2024/05/08 15:42:40 packets.go:37: read tcp 127.0.0.1:47034->127.0.0.1:6001: i/o timeout [mysql] 2024/05/08 15:42:50 packets.go:37: read tcp 127.0.0.1:59082->127.0.0.1:6001: i/o timeout [mysql] 2024/05/08 15:43:13 packets.go:37: read tcp 127.0.0.1:45292->127.0.0.1:6001: i/o timeout [mysql] 2024/05/08 15:43:15 packets.go:37: read tcp 127.0.0.1:36676->127.0.0.1:6001: i/o timeout panic: internal error: driver info: retry time out [recovered] panic: internal error: driver info: retry time out

goroutine 66734 [running]: github.com/matrixorigin/matrixone/pkg/vm/engine/tae/logstore/driver/logservicedriver.NewLogServiceDriver.func1({0x3db80a0?, 0xc058ad0010?}) /home/pengzhen/Documents/temp/matrixone-temp/pkg/vm/engine/tae/logstore/driver/logservicedriver/driver.go:89 +0x1d github.com/panjf2000/ants/v2.(goWorker).run.func1.1() /home/pengzhen/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.4/worker.go:54 +0x6d panic({0x3db80a0?, 0xc058ad0010?}) /usr/local/go/src/runtime/panic.go:914 +0x21f go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x0?, 0x77609c0?, {0x0?, 0x0?, 0xc04f728980?}) /home/pengzhen/go/pkg/mod/go.uber.org/zap@v1.24.0/zapcore/entry.go:198 +0x54 go.uber.org/zap/zapcore.(CheckedEntry).Write(0xc046bb1e10, {0x0, 0x0, 0x0}) /home/pengzhen/go/pkg/mod/go.uber.org/zap@v1.24.0/zapcore/entry.go:264 +0x3ec go.uber.org/zap.(Logger).Panic(0xc01bb70720?, {0xc000127680?, 0x0?}, {0x0, 0x0, 0x0}) /home/pengzhen/go/pkg/mod/go.uber.org/zap@v1.24.0/logger.go:258 +0x51 github.com/matrixorigin/matrixone/pkg/logutil.Panic({0xc000127680?, 0x21?}, {0x0?, 0x1?, 0x1?}) /home/pengzhen/Documents/temp/matrixone-temp/pkg/logutil/api.go:41 +0x85 github.com/matrixorigin/matrixone/pkg/vm/engine/tae/logstore/driver/logservicedriver.(driverAppender).append(0xc050a44380, 0xc0082fdfa8?, 0x2540be400) /home/pengzhen/Documents/temp/matrixone-temp/pkg/vm/engine/tae/logstore/driver/logservicedriver/appender.go:104 +0x84f github.com/matrixorigin/matrixone/pkg/vm/engine/tae/logstore/driver/logservicedriver.(LogServiceDriver).onAppendQueue.func1() /home/pengzhen/Documents/temp/matrixone-temp/pkg/vm/engine/tae/logstore/driver/logservicedriver/append.go:67 +0x27 github.com/panjf2000/ants/v2.(goWorker).run.func1() /home/pengzhen/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.4/worker.go:67 +0x8d created by github.com/panjf2000/ants/v2.(*goWorker).run in goroutine 1730 /home/pengzhen/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.4/worker.go:48 +0x5c

daviszhen commented 4 months ago

tpcc 10仓10并发。大概跑了半个小时。