milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
29.39k stars 2.82k forks source link

[Bug]: wrong cp lag metrics #35588

Open pingliu opened 3 weeks ago

pingliu commented 3 weeks ago

Is there an existing issue for this?

Environment

- Milvus version:2.4.9
- Deployment mode(standalone or cluster):standalone
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

截屏2024-08-20 16 19 40

Expected Behavior

No response

Steps To Reproduce

from 2.3.x upgrade to 2.4.9

Milvus Log

No response

Anything else?

No response

pingliu commented 3 weeks ago

/assign @XuanYang-cn

yanliang567 commented 3 weeks ago

/unassign

XuanYang-cn commented 3 weeks ago

Channel checkpoint meta lifecycle is buggy. Checkpoints are often left in the meta even if collections are dropped. And the creation and the deletion of the metrics are also in a chaos.

Here're the rules I need to check:

  1. When creating collections, channel watchinfo and channel cp should be created.
  2. When dropping collections, channel watchinfo and channel cp should be dropped.
  3. When DC recovers channel, channelcp and channel watch info should be recovered for VALID collection.
  4. When DC drop channel, channel cp and channel watch info should be removed.
  5. Only DN's updateChannelCheckpoint is able to update channel checkpoint.
xiaofan-luan commented 3 weeks ago

Channel checkpoint meta lifecycle is buggy. Checkpoints are often left in the meta even if collections are dropped. And the creation and the deletion of the metrics are also in a chaos.

Here're the rules I need to check:

  1. When creating collections, channel watchinfo and channel cp should be created.
  2. When dropping collections, channel watchinfo and channel cp should be dropped.
  3. When DC recovers channel, channelcp and channel watch info should be recovered for VALID collection.
  4. When DC drop channel, channel cp and channel watch info should be removed.
  5. Only DN's updateChannelCheckpoint is able to update channel checkpoint.

do we need to hack channel cp meta to fix the problem for now?

XuanYang-cn commented 3 weeks ago

@xiaofan-luan I think so. I believe this is causing numerous false alarms, very annoying. see milvus-io/birdwatcher#303