[Bug]: After milvus runs for a day, the overall response becomes slow and the service is unavailable

milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications

https://milvus.io

Apache License 2.0

31.05k stars 2.95k forks source link

[Bug]: After milvus runs for a day, the overall response becomes slow and the service is unavailable #26307

Closed Richard-lrg closed 1 year ago

Richard-lrg commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version:2.2.12
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): jdk
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: 32c 128g *3
- GPU: no
- Others:

Current Behavior

After milvus runs for a day, the overall response becomes slow and the service is unavailable the err msg for response is : DEADLINE_EXCEEDED: deadline exceeded after 19.999893177s

use background： a lot of collections were created on this day, about 500 and all of these collection indexes are created by FLAT

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

milvus log info： https://pan.baidu.com/s/1lNSwDdKLzm-0raNY-fDR-A?pwd=nmnu

Anything else?

No response

xiaofan-luan commented 1 year ago

/assign @aoiasd

xiaofan-luan commented 1 year ago

/assign @sunby

xiaofan-luan commented 1 year ago

We have to fix the collection number issue. at least the limitation should not be a few hundred

Richard-lrg commented 1 year ago

We have to fix the collection number issue. at least the limitation should not be a few hundred

what is the meaning? Is there any way to solve this problem now?

yanliang567 commented 1 year ago

@Cactus-L could you please share more info about your 500 collections? any chance you could use partition key feature as a workaround or a better solution for now?

/unassign

yanliang567 commented 1 year ago

@Cactus-L could you please share more info about your 500 collections? any chance you could use partition key feature as a workaround or a better solution for now? Click here for more info about partition key:https://milvus.io/docs/partition_key.md#Partition-Key

/unassign

Richard-lrg commented 1 year ago

@Cactus-L could you please share more info about your 500 collections? any chance you could use partition key feature as a workaround or a better solution for now? Click here for more info about partition key:https://milvus.io/docs/partition_key.md#Partition-Key /unassign ok, I can try the way of partition key. Can you tell me the main reason for the current problem? I will try my best to avoid it.

aoiasd commented 1 year ago

@Cactus-L could you please share more info about your 500 collections? any chance you could use partition key feature as a workaround or a better solution for now? Click here for more info about partition key:https://milvus.io/docs/partition_key.md#Partition-Key /unassign ok, I can try the way of partition key. Can you tell me the main reason for the current problem? I will try my best to avoid it.

Every shard of collection will send empty message to message queue for push time tick, so too much collection means produce and consume will be very frequent. So set a larger proxy.timeTickInterval could help reduce frequency, but search or query will wait longer if use strong consistency.

And because we limit topic num of message queue, if shard num more than your dmlChannelNum, some shard will use same topic of message queue, and will read some invaild data of other shard, and will increase pressure on message queue. So set a larger rootCoord.dmlChannelNum could help.

Richard-lrg commented 1 year ago

got it, thx

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.