milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
27.28k stars 2.63k forks source link

[Bug]: v2.4.0 query node auto_balance no available #32714

Open yesyue opened 3 weeks ago

yesyue commented 3 weeks ago

Is there an existing issue for this?

Environment

- Milvus version:2.4.0
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):  kafka   
- SDK version(e.g. pymilvus v2.0.0rc2): 2.7
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: 
- GPU:  0
- Others:

Current Behavior

My Collection loading scale of 40 million entites to the mem index. I has enabled 40 query nodes, but the data is only loaded into 2 query nodes. Here is my configuration, how to adjust it to enable automatic memory balancing allocation: d034f7bfdac8a24c666948e9965755e

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

xiaofan-luan commented 3 weeks ago

Is there an existing issue for this?

  • [x] I have searched the existing issues

Environment

- Milvus version:2.4.0
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):  kafka   
- SDK version(e.g. pymilvus v2.0.0rc2): 2.7
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: 
- GPU:  0
- Others:

Current Behavior

My Collection loading scale of 40 million entites to the mem index. I has enabled 40 query nodes, but the data is only loaded into 2 query nodes. Here is my configuration, how to adjust it to enable automatic memory balancing allocation: d034f7bfdac8a24c666948e9965755e

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

so you are saying you got 40m data but they got only 2 nodes with data?

  1. could you offer querycoord and querynode logs?
  2. could you share your segment distribution?
xiaofan-luan commented 3 weeks ago

/assign @sunby please help to follow

yesyue commented 3 weeks ago

the attu-client show the error msg follow:

show collection failed: load segment failed, OOM if load, maxSegmentSize = 205.4959774017334 MB, memUsage = 121813.59168624878 MB, predictMemUsage = 122019.08766365051 MB, totalMem = 122880 MB thresholdFactor = 0.900000

yesyue commented 3 weeks ago

segment: assignmentExpiration: 2000 compactableProportion: 0.85 diskSegmentMaxSize: 2048 enableLevelZero: true expansionRate: 1.25 maxBinlogFileNumber: 32 maxIdleTime: 600 maxLife: 86400 maxSize: 1024 minSizeFromIdleToSealed: 16 sealProportion: 0.12 smallProportion: 0.5

yanliang567 commented 3 weeks ago

@yesyue Could you please refer this doc to export the whole Milvus logs for investigation? Also Could you please attach the etcd backup for investigation? Check this: https://github.com/milvus-io/birdwatcher for details about how to backup etcd with birdwatcher /assign @yesyue /unassign

xiaofan-luan commented 3 weeks ago

the attu-client show the error msg follow:

show collection failed: load segment failed, OOM if load, maxSegmentSize = 205.4959774017334 MB, memUsage = 121813.59168624878 MB, predictMemUsage = 122019.08766365051 MB, totalMem = 122880 MB thresholdFactor = 0.900000

it seems that you can not load, might be for the unbalance reason.

you can use birdwatcher to check if all the segment is sealed or indexed. We can not help with detailed logs and info from birdwatcher.

Using birdwatcher with show segment command can help you to figure out why

yesyue commented 2 weeks ago

only one query node mem high, and increasing

querynode (3).log image

sunby commented 1 week ago

only one query node mem high, and increasing

querynode (3).log image

Can you provide querynode Segment Loaded Num and Queryable Entity Num metrics?

sunby commented 1 week ago

you can use birdwatcher and run download global-distribution command and paste generated distribution file here.