[Bug]: Milvus Standalone keeps restarting/crashing

tanvlt commented 1 day ago

Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version: 2.4.15
- Deployment mode(standalone or cluster): Standalone
- MQ type(rocksmq, pulsar or kafka): rocksmq 
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): Kubernetes
- CPU/Memory: Azure standard_d16_v5 (16/64)
- GPU: 
- Others: 
 + External storage: S3

Current Behavior

Recently weeks Milvus v2.4.14 keeps restarting in a few period in a day even though there was not much usage.
It could be related to this issue https://github.com/milvus-io/milvus/pull/37354
On last Saturday i upgraded v2.4.15, but still got the restarting issue a few times on yesterday and today

Expected Behavior

Stop restarting

Steps To Reproduce

No response

Milvus Log

logs.tar.gz

Anything else?

No response

yanliang567 commented 1 day ago

@tanvlt checking the logs, Milvus is restarting for lost the heartbeat with etcd service. Normal Killing 57m (x2 over 25h) kubelet Container standalone failed liveness probe, will be restarted please doube check

the etcd is running against SSD volumes for high performance
request more cpu resource for milvus pod to keep alive with etcd(this only happens when there are heavy workloads running on milvus) /assign @tanvlt /unassign

tanvlt commented 1 day ago

Hi @yanliang567 thanks for checking

About etcd, it is running on Azure disk: size 5GB, Storage type Standard SSD LRS, IOPS 500. Is it enough or should be IOPS 3000?
About the CPU: that node is set only for Milvus and i supposed it is able to use all of CPU on that node, it was consuming a lot during restarting time But Can you also help to look into the previous pod log, that i collected before crashing? Explore-logs-2024-12-03 14_43_23.txt I just wanted to make sure that is the issue from Disk or CPU setup.

yanliang567 commented 1 day ago

the average cpu usage is around 500%, which is too high. how many cpu cores did you request and limit for the milvus pod? if milvus is running exclusively on the node, please set the request and the limit the same value, which helps in milvus stability and performance.
the previous pod log is all INFO, please set the milvus log level to debug so that we can see more info if it reproduced.

tanvlt commented 1 day ago

hi @yanliang567

I only request 1 core and did not set the limit
Example: there is 16 cores, how many should i request and limit? Btw, there is about 9000 collection in my Milvus this moment

yanliang567 commented 1 day ago

@tanvlt would you like a call for talking about your scenarios? please free to mail me via yanliang.qiao@zilliz.com with your available time and contact info.

xiaofan-luan commented 1 day ago

I guess this might be a too many collection issue. @bigsheeper has been working on it for a while. The latest Milvus optimized on it. But in general 9000 collection is still too much for one milvus one single cluster. What is the target colleciton number?

bigsheeper commented 1 day ago

I guess this might be a too many collection issue. @bigsheeper has been working on it for a while. The latest Milvus optimized on it. But in general 9000 collection is still too much for one milvus one single cluster. What is the target colleciton number?

Yes, I guess this is related to periodic, large-scale metadata transactions triggered by high number of collections.

@tanvlt Could you please check the meta request rate monitoring and see if the periods of high transaction rates align with the times when the Milvus restarted? This information would help us better understand the issue. The monitoring looks like this:

tanvlt commented 8 hours ago

hi @bigsheeper i have not enabled that monitor yet, let me enable and get back to you soon

tanvlt commented 7 hours ago

hi @xiaofan-luan We have not a certain target for number of collections, we have just started our product and it will be increase more. Btw @xiaofan-luan i have checked in FQA https://milvus.io/docs/v2.4.x/product_faq.md#Is-there-a-limit-to-the-total-number-of-collections-and-partitions-in-Milvus It mentions that i can create up to 65.000 collections, is that correct? We are following "One collection per tenant" approach and no shard or partition setting https://milvus.io/docs/v2.4.x/multi_tenancy.md

xiaofan-luan commented 6 hours ago

We do see severe performance bottle neck and stability issue when collection number is more than 5k in Milvus 2.4.x. @bigsheeper is actually working on improving and some of the recent release might help. The goal of the new release is to support 10K collections with 1000partitions in each collection, right now it is still challenging. To implement an multi tenancy app, see https://milvus.io/docs/multi_tenancy.md. Partitionkey might be something you actaully need

zrg-team commented 6 hours ago

Hi @xiaofan-luan seems we started the project before the Partition-key-based tenant was implemented. Could you share partition-key usage documents? Do you have any idea how to smoothly migrate from collection to partion key approach?

zrg-team commented 2 hours ago

For partition_key_field approach, we will store tenant data in same collection but difference partition_key_field ?

milvus-io / milvus