milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
31.13k stars 2.95k forks source link

[Bug]: Milvus Standalone keeps restarting/crashing #38171

Open tanvlt opened 1 day ago

tanvlt commented 1 day ago

Is there an existing issue for this?

Environment

- Milvus version: 2.4.15
- Deployment mode(standalone or cluster): Standalone
- MQ type(rocksmq, pulsar or kafka): rocksmq 
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): Kubernetes
- CPU/Memory: Azure standard_d16_v5 (16/64)
- GPU: 
- Others: 
 + External storage: S3

Current Behavior

Expected Behavior

Steps To Reproduce

No response

Milvus Log

logs.tar.gz

Anything else?

No response

yanliang567 commented 1 day ago

@tanvlt checking the logs, Milvus is restarting for lost the heartbeat with etcd service. Normal Killing 57m (x2 over 25h) kubelet Container standalone failed liveness probe, will be restarted please doube check

  1. the etcd is running against SSD volumes for high performance
  2. request more cpu resource for milvus pod to keep alive with etcd(this only happens when there are heavy workloads running on milvus) /assign @tanvlt /unassign
tanvlt commented 1 day ago

Hi @yanliang567 thanks for checking

yanliang567 commented 1 day ago
  1. the average cpu usage is around 500%, which is too high. how many cpu cores did you request and limit for the milvus pod? if milvus is running exclusively on the node, please set the request and the limit the same value, which helps in milvus stability and performance.
  2. the previous pod log is all INFO, please set the milvus log level to debug so that we can see more info if it reproduced.
tanvlt commented 1 day ago

hi @yanliang567

yanliang567 commented 1 day ago

@tanvlt would you like a call for talking about your scenarios? please free to mail me via yanliang.qiao@zilliz.com with your available time and contact info.

xiaofan-luan commented 1 day ago

I guess this might be a too many collection issue. @bigsheeper has been working on it for a while. The latest Milvus optimized on it. But in general 9000 collection is still too much for one milvus one single cluster. What is the target colleciton number?

bigsheeper commented 1 day ago

I guess this might be a too many collection issue. @bigsheeper has been working on it for a while. The latest Milvus optimized on it. But in general 9000 collection is still too much for one milvus one single cluster. What is the target colleciton number?

Yes, I guess this is related to periodic, large-scale metadata transactions triggered by high number of collections.

@tanvlt Could you please check the meta request rate monitoring and see if the periods of high transaction rates align with the times when the Milvus restarted? This information would help us better understand the issue. The monitoring looks like this: image

tanvlt commented 8 hours ago

hi @bigsheeper i have not enabled that monitor yet, let me enable and get back to you soon

tanvlt commented 7 hours ago

hi @xiaofan-luan We have not a certain target for number of collections, we have just started our product and it will be increase more. Btw @xiaofan-luan i have checked in FQA https://milvus.io/docs/v2.4.x/product_faq.md#Is-there-a-limit-to-the-total-number-of-collections-and-partitions-in-Milvus It mentions that i can create up to 65.000 collections, is that correct? We are following "One collection per tenant" approach and no shard or partition setting https://milvus.io/docs/v2.4.x/multi_tenancy.md

xiaofan-luan commented 6 hours ago

We do see severe performance bottle neck and stability issue when collection number is more than 5k in Milvus 2.4.x. @bigsheeper is actually working on improving and some of the recent release might help. The goal of the new release is to support 10K collections with 1000partitions in each collection, right now it is still challenging. To implement an multi tenancy app, see https://milvus.io/docs/multi_tenancy.md. Partitionkey might be something you actaully need

zrg-team commented 6 hours ago

Hi @xiaofan-luan seems we started the project before the Partition-key-based tenant was implemented. Could you share partition-key usage documents? Do you have any idea how to smoothly migrate from collection to partion key approach?

zrg-team commented 2 hours ago

For partition_key_field approach, we will store tenant data in same collection but difference partition_key_field ?