matrixorigin / matrixone

Hyperconverged cloud-edge native database
https://docs.matrixorigin.cn/en
Apache License 2.0
1.72k stars 271 forks source link

[Bug]: cn crashed by oom during stability test on distributed mode #16573

Open aressu1985 opened 1 month ago

aressu1985 commented 1 month ago

Is there an existing issue for the same bug?

Branch Name

main

Commit ID

d38e334

Other Environment Information

- Hardware parameters:
3*CN: 16C 64G
1*DN: 16C 64G
3*LOG: 4C 16G
2*PROXY: 3C 6G
- OS type:
- Others:

Actual Behavior

cn crashed by oom during stability test on distributed mode.

image

resources dashboard: https://shanghai.idc.matrixorigin.cn:30001/d/85a562078cdf77779eaa1add43ccec1e/kubernetes-compute-resources-namespace-pods?orgId=1&var-datasource=prometheus&var-cluster=&var-namespace=mo-nightly-d38e334-20240531221619&from=1717157528000&to=1717237508000

Expected Behavior

No response

Steps to Reproduce

1. run a mo cluster with config in this issue
2. run tpch 10G loop test processes in one independant tenant
3. run tpcc 10 warehouse and 10 ternimals longrunnig test processes in one independant tenant, prepare mode
4. run sysbench mixed cases(insert/delete/update/select) longrunnig test processes with 75 terminals in one independant tenant,non-prepare mode
5. run another sysbench mixed cases(insert/delete/update/select) longrunnig test processe with  75 terminals in one independant tenant,non-prepare mode

Additional information

No response

aressu1985 commented 1 month ago

具体prof消息还没有具体找到,待补充

arjunsk commented 1 month ago

This could be due to mpool. m-schen, please have a look at this.

ouyuanning commented 1 month ago

乐声先帮忙看看,不知道跟tpcc的那个oom有没有一定关系

reusee commented 1 month ago

在优化分配器

reusee commented 1 month ago

无进展

reusee commented 1 month ago

在优化mpool

reusee commented 1 month ago

possible fixed commit/PRs: https://github.com/matrixorigin/matrixone/commit/e520a422ab2d0566593d651365f73c29d62a538a https://github.com/matrixorigin/matrixone/pull/17113

reusee commented 1 month ago

继续优化

reusee commented 3 weeks ago

继续优化

reusee commented 3 weeks ago

无进展

reusee commented 2 weeks ago

oom 应该已经修复

reusee commented 1 week ago

如上

reusee commented 1 week ago

无进展

reusee commented 4 days ago

working on other issues.

reusee commented 1 day ago

已知的优化都合并了