milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
31.02k stars 2.95k forks source link

[Bug]: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded #38030

Open TonyAnn opened 2 days ago

TonyAnn commented 2 days ago

Is there an existing issue for this?

Environment

- Milvus version:2.3.21
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    pulsar 
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

The client request timed out

rootcoord throws the following error: [ERROR] [retry/retry.go:46] ["retry func failed"] ["retry time"=4] [error="server error: ProducerBlockedQuotaExceededException: Cannot create producer on topic with backlog quota exceeded"] [stack="github.com/milvus-io/milvus/pkg/util/retry.Do\n\t/workspace/source/pkg/util/retry/retry.go:46\ngithub.com/milvus-io/milvus/pkg/mq/msgstream.(mqMsgStream).AsProducer\n\t/workspace/source/pkg/mq/msgstream/mq_msgstream.go:144\ngithub.com/milvus-io/milvus/internal/rootcoord.newDmlChannels\n\t/workspace/source/internal/rootcoord/dml_channels.go:201\ngithub.com/milvus-io/milvus/internal/rootcoord.newTimeTickSync\n\t/workspace/source/internal/rootcoord/timeticksync.go:121\ngithub.com/milvus-io/milvus/internal/rootcoord.(Core).initInternal\n\t/workspace/source/internal/rootcoord/root_coord.go:473\ngithub.com/milvus-io/milvus/internal/rootcoord.(Core).Init.func1.1\n\t/workspace/source/internal/rootcoord/root_coord.go:530\nsync.(Once).doSlow\n\t/usr/local/go/src/sync/once.go:74\nsync.(Once).Do\n\t/usr/local/go/src/sync/once.go:65\ngithub.com/milvus-io/milvus/internal/rootcoord.(Core).Init.func1\n\t/workspace/source/internal/rootcoord/root_coord.go:529\ngithub.com/milvus-io/milvus/internal/util/sessionutil.(Session).ProcessActiveStandBy\n\t/workspace/source/internal/util/sessionutil/session_util.go:1103\ngithub.com/milvus-io/milvus/internal/rootcoord.(Core).Register.func2\n\t/workspace/source/internal/rootcoord/root_coord.go:283"]

At the same time, some collection checkpoints are not updated in time: list the latest checkpoint of all physical channels: pchannel: by-dev-rootcoord-dml_0, the lastest checkpoint ts: 2024-11-25 15:49:49.874 +0800 CST pchannel: by-dev-rootcoord-dml_10, the lastest checkpoint ts: 2024-11-13 20:18:21.874 +0800 CST pchannel: by-dev-rootcoord-dml_14, the lastest checkpoint ts: 2024-11-21 22:52:58.473 +0800 CST pchannel: by-dev-rootcoord-dml_1, the lastest checkpoint ts: 2024-11-26 15:35:59.111 +0800 CST pchannel: by-dev-rootcoord-dml_11, the lastest checkpoint ts: 2024-11-26 15:35:44.211 +0800 CST pchannel: by-dev-rootcoord-dml_13, the lastest checkpoint ts: 2024-11-21 23:40:58.474 +0800 CST pchannel: by-dev-rootcoord-dml_4, the lastest checkpoint ts: 2024-11-25 14:51:15.273 +0800 CST pchannel: by-dev-rootcoord-dml_2, the lastest checkpoint ts: 2024-11-22 13:59:01.873 +0800 CST pchannel: by-dev-rootcoord-dml_7, the lastest checkpoint ts: 2024-11-23 15:09:22.274 +0800 CST pchannel: by-dev-rootcoord-dml_9, the lastest checkpoint ts: 2024-11-23 15:10:02.673 +0800 CST pchannel: by-dev-rootcoord-dml_15, the lastest checkpoint ts: 2024-11-25 14:57:53.673 +0800 CST pchannel: by-dev-rootcoord-dml_8, the lastest checkpoint ts: 2024-11-19 14:16:38.274 +0800 CST pchannel: by-dev-rootcoord-dml_12, the lastest checkpoint ts: 2024-11-21 16:17:38.074 +0800 CST pchannel: by-dev-rootcoord-dml_6, the lastest checkpoint ts: 2024-11-26 15:45:44.117 +0800 CST vchannel: doesn't exists in collection: 453281916976337927

Expected Behavior

Please help me locate the problem and how to fix the situation where the checkpoint is not updated.

Steps To Reproduce

No response

Milvus Log

milvus-log.tar.gz

Anything else?

No response

xiaofan-luan commented 2 days ago

@TonyAnn

xiaofan-luan commented 2 days ago

try to use the pulsarctl tool and find who is the subscriber of this topic and remove the pulsar topic should resolve this problem

yanliang567 commented 2 days ago

/assign @TonyAnn

TonyAnn commented 2 days ago

try to use the pulsarctl tool and find who is the subscriber of this topic and remove the pulsar topic should resolve this problem @xiaofan-luan I have a question. If I manually clean up the pulsar topic, will it cause data loss? I understand that the current error is due to the backlog caused by the datanode not consuming in time? Is my understanding correct?

yanliang567 commented 38 minutes ago

@LoveEachDay any comments?

/assign @LoveEachDay /unassign