[Bug]: v2.4.0 datanode 内存使用过高

milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications

https://milvus.io

Apache License 2.0

27.29k stars 2.63k forks source link

[Bug]: v2.4.0 datanode 内存使用过高 #32695

Open yesyue opened 3 weeks ago

yesyue commented 3 weeks ago

Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version: v2.4.0
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):    kafka 
- SDK version(e.g. pymilvus v2.0.0rc2): 2.7
- OS(Ubuntu or CentOS):  CentOS
- CPU/Memory: 544c /4291.6 G at least
- GPU:  0 
- Others: datanode

Current Behavior

参考sizing tools 分配Data Node ， 2 core 8 GB x 2pods ，实际运行出现OOM ，扩容后内存占用达40G

Expected Behavior

参考sizing tools 分配Data Node ， 2 core 8 GB x 2pods ，实际运行出现OOM ，扩容后内存占用达40G

Steps To Reproduce

参考sizing tools 分配Data Node ， 2 core 8 GB x 2pods ， 实际运行出现OOM ， 扩容后内存占用达40G

Milvus Log

No response

Anything else?

No response

github-actions[bot] commented 3 weeks ago

The title and description of this issue contains Chinese. Please use English to describe your issue.

yesyue commented 3 weeks ago

Referring to the Sizing Tools, allocate Data Nodes with 2 cores of 8 GB x 2 pods. However, during actual operation, the Data Nodes was an OOM, and after expansion, the memory usage reached 40G.

yesyue commented 3 weeks ago

datanode log:

datanode.log

yanliang567 commented 3 weeks ago

@yesyue please share more info about how you using milvus, e.g. what kinds of requests did you call to milvus, how many, and how frequency of them? also please help all the milvus pods logs for invesgitaion.

/assign @yesyue /unassign

yesyue commented 3 weeks ago

100 Million/day entites write to milvus

tadinhkien99 commented 3 weeks ago

100 Million/day entites write to milvus

after I inserted 10M entites total, then milvus docker stop and crash. I use IVF_SQ8 index, installed milvus with gpu. I use batch insert 10000 (only insert if enough 10000 entities.

after crash I can't connect to connection again and can't use anything. Any solution?

xiaofan-luan commented 3 weeks ago

seems that flush can not catch up the read.
how many partitions do you have? if you have many partitions or collections, the flush and memory consumption will be larger than estimation.
there is bunch of configs to tune, like concurrent flush number -> dataNode.dataSync.maxParallelSyncMgrTasks (for 2.4) memory used for growing segment

xiaofan-luan commented 3 weeks ago

100 Million/day entites write to milvus

after I inserted 10M entites total, then milvus docker stop and crash. I use IVF_SQ8 index, installed milvus with gpu. I use batch insert 10000 (only insert if enough 10000 entities.

after crash I can't connect to connection again and can't use anything. Any solution?

how much gpu memory do you have? please open another issue with detailed logs so we can help

yesyue commented 2 weeks ago

querynode (3).log

xiaofan-luan commented 2 weeks ago

querynode (3).log

1.could you offer log for datanode?

it would be great if you have a datanode pprof, so you know which part takes of your memory. Most likely it's insert buffer takes the memory and you can tune the flush parameter

xiaofan-luan commented 2 weeks ago

I saw you in many issues and we'd like to offer help. feel free to contact me at xiaofan.luan@zilliz.com if necessary