milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
29.44k stars 2.82k forks source link

[Bug]: After a large number of deletes and inserts, the queue backlog space is exhausted, causing the collection to be blocked and unable to query. #35813

Open TonyAnn opened 2 weeks ago

TonyAnn commented 2 weeks ago

Is there an existing issue for this?

Environment

- Milvus version:2.2.16
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    pulsar 
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

rootcoord log info: time="2024-08-28T03:21:21Z" level=error msg="[Failed to create producer]" error="server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded" producerID=1 producer_name=my-release-pulsar-1-30 topic="persistent://public/default/by-dev-rootcoord-dml_0" log_and_monitor_snap.zip

Expected Behavior

  1. How to solve the problem of blocking collection query after the pulsar queue backlog space is used up
  2. How to optimize a large number of mivlus delete and insert

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

yanliang567 commented 2 weeks ago

@TonyAnn we recommended you upgrade to latest milvus release 2.4.10.
/assign @congqixia please help to check if we have some workaround for this case

/unassign

congqixia commented 2 weeks ago
  1. from the error message, there are 2 possible problem here a. there are some orphan subscription left in you pulsar, you could try to use pulsarctl to check which topics and subscriptions are causing this problem. b. the datanode failed to consume and digest the dml data, which causing the backlog size continue to raise.
  2. we recommend to upgrade you cluster to higher version(first 2.3 latest then 2.4 to avoid major version incompatibility) to avoid lots of known issue.
  3. if you want to insert large number of data, the Import API is highly recommended instead of Insert
congqixia commented 2 weeks ago

Also, if the data in mq is important, we recommend to update pulsar configuration to stop pulsar gc in case of data loss