matrixorigin / matrixone

Hyperconverged cloud-edge native database
https://docs.matrixorigin.cn/en
Apache License 2.0
1.77k stars 274 forks source link

[Bug]: sysbench test reported error FATAL: Worker threads failed to initialize within 30 seconds! #17001

Open heni02 opened 3 months ago

heni02 commented 3 months ago

Is there an existing issue for the same bug?

Branch Name

main

Commit ID

6a43a6e7a

Other Environment Information

- Hardware parameters:
- OS type:
- Others:

Actual Behavior

job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9567035983/job/26400260798

企业微信截图_d7a65ef7-a6e7-42e3-b87e-ae6716c720ba

mo log: https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22kbI%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-regression-20240618%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221718774634000%22,%22to%22:%221718774664000%22%7D%7D%7D&schemaVersion=1&orgId=1

Expected Behavior

No response

Steps to Reproduce

tke regression sysbench1000w nopk-load and nopk-point-select test

Additional information

No response

reusee commented 3 months ago

https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22aaX%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bmatrixorigin_io_namespace%3D%5C%22mo-error-case--nightly-57a9e8776-20240624%5C%22%7D%20%7C%3D%20%60slow%20event%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%22now-12h%22,%22to%22:%22now%22%7D%7D%7D&schemaVersion=1&orgId=1

没有记录到慢的 I/O,不是 io 开销过大的问题。

reusee commented 3 months ago

原始错误信息

企业微信截图_a339aa93-a068-4a68-9016-dc1be1b8adfc
reusee commented 3 months ago

在 2024-06-24 04:00 左右的错误信息:https://grafana.ci.matrixorigin.cn/goto/KtEhzSQIR?orgId=1

可以看到有大量的 failed to init stats info for table 以及 connect: cannot assign requested address。后者似乎是因为fd耗尽,无法bind导致,可能是有网络连接未关闭,或者同时发起太多。

而且只有 04:00 这个时间有 connect: cannot assign requested address 的错误信息:https://grafana.ci.matrixorigin.cn/goto/Agb4iSwIR?orgId=1

reusee commented 3 months ago

Is there an existing issue for the same bug?

* [x]  I have checked the existing issues.

Branch Name

main

Commit ID

6a43a6e

Other Environment Information

- Hardware parameters:
- OS type:
- Others:

Actual Behavior

job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9567035983/job/26400260798 企业微信截图_d7a65ef7-a6e7-42e3-b87e-ae6716c720ba

mo log: https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22kbI%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-regression-20240618%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221718774634000%22,%22to%22:%221718774664000%22%7D%7D%7D&schemaVersion=1&orgId=1

Expected Behavior

No response

Steps to Reproduce

tke regression sysbench1000w nopk-load and nopk-point-select test

Additional information

No response

这里又是另一个问题:https://grafana.ci.matrixorigin.cn/goto/nyU3iIQIR?orgId=1

看上去 cannot get table by ID 是主因,也是 disttae 里的 stats 相关。

reusee commented 3 months ago

两次重现,都和 disttae 包内的 stats 相关,先转给 @volgariver6 做进一步分析

volgariver6 commented 3 months ago

用改过之后的分支跑,还是会有问题,目前没有定位到具体原因,只是观察到在出问题的时候,创建了大量的goroutines。

volgariver6 commented 3 months ago

尚无进展

volgariver6 commented 2 months ago

正在测试

volgariver6 commented 2 months ago

目前测试还是会报错,麻烦 @reusee 继续跟踪一下

XuPeng-SH commented 2 months ago

https://github.com/matrixorigin/mo-auto-test/actions/runs/9890430540 测试两次没有问题了

heni02 commented 2 months ago

confirm,closed commit:88e486e11 https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9976227425

Ariznawlll commented 1 month ago

0816: commit:081becbc9158278b3bd974075a40e3539aafa1db

job urlhttps://github.com/matrixorigin/mo-nightly-regression/actions/runs/10405469110/job/28842654264

image

第一次出现 FATAL的log:https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22F03%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-main-nightly-081becbc9-20240815%5C%22%7D%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221723786939000%22,%22to%22:%221723786969000%22%7D%7D%7D&schemaVersion=1&orgId=1

volgariver6 commented 1 month ago

@Ariznawlll 这个麻烦跑一下二分吧,估计是最近的哪个pr引起

volgariver6 commented 1 month ago

等待二分结果

heni02 commented 1 month ago

二分查到是18100pr,cc @ouyuanning

企业微信截图_95491f00-ac90-4023-b160-ddc6e6b464a3

job:https://github.com/matrixorigin/mo-auto-test/actions/runs/10508265514/job/29112655376

企业微信截图_94e9defb-bbe9-4b55-b7b7-fe958119c386
triump2020 commented 1 month ago

二分查到是18100pr,cc @ouyuanning 企业微信截图_95491f00-ac90-4023-b160-ddc6e6b464a3

job:https://github.com/matrixorigin/mo-auto-test/actions/runs/10508265514/job/29112655376 企业微信截图_94e9defb-bbe9-4b55-b7b7-fe958119c386

@heni02 确定是 18100 这个pr 么? 这个pr 只是调用了reader.Close , reader.Close 里面啥都没干,就是把一个迭代器复位了下.

ouyuanning commented 1 month ago

报错跟#17793一样, 目前1.2-dev也发现类似问题。应该不是二分到的这2个PR引入的

heni02 commented 1 month ago

更新下1.2报错信息: 1.2-dev commit:59d156b46 job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/10634487640/job/29503486397

企业微信截图_921e8c75-c715-416e-90b4-903e79e0d9db

mo log: https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22qFM%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-branch-nightly-59d156b46-20240830%5C%22%7D%20%7C%3D%20%60operation%20error%20S3:%20GetObject,%20exceeded%20maximum%20number%20of%20attempts,%203%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221725083373948%22,%22to%22:%221725083702716%22%7D%7D%7D&schemaVersion=1&orgId=1

reusee commented 3 weeks ago

需要采集进程的socket信息,才能确认泄漏源头

reusee commented 2 weeks ago

无进展

reusee commented 1 week ago

无进展

reusee commented 5 days ago

无进展

heni02 commented 11 hours ago

1.2-dev commit:b0f3a5481复现了,环境已配置相关参数 job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/11142247340/job/30999607062

企业微信截图_65967e6e-195a-4ffe-bd53-b4d547c756c8

yaml: https://github.com/matrixorigin/mo-nightly-regression/blob/main/mo-bench-tke.yaml

threads初始时间:09:36:09 报错时间:09:36:39 mo log: https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22Wjp%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-branch-nightly-b0f3a5481-20241002%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221727919369000%22,%22to%22:%221727919399000%22%7D%7D%7D&schemaVersion=1&orgId=1

企业微信截图_0ee34a95-f430-4bd3-90db-fea1dddebe1f