milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
31k stars 2.95k forks source link

[Bug]: ShardCluster failed to add new-added node due to etcd watch failed #23637

Closed MrPresent-Han closed 1 year ago

MrPresent-Han commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version:2.2.7
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

scale out a cluster from 13 qnodes to 15 nodes, segments should be balanced to the new added node. But query node shows 'failed to load segment, node not in cluster'

Expected Behavior

segment balance smoothly

Steps To Reproduce

No response

Milvus Log

https://grafana-4am.zilliz.cc/explore?orgId=1&left=%7B%22datasource%22:%22Loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22qa-milvus%5C%22,%20pod%3D%5C%22fouramf-c2g64-62-7774-milvus-querycoord-75666d4c4d-gvv4m%5C%22%7D%20%7C%3D%20%5C%22failed%20to%20load%20segment%5C%22%22%7D%5D,%22range%22:%7B%22from%22:%221682067600000%22,%22to%22:%221682073000000%22%7D%7D

Anything else?

it's clear that the new node 21 has been added into the replica on qc zdGSOxNEoP

but etch watch failed and rewatch continuously on the querynode 2023-04-21 18:59:39 | [2023/04/21 10:59:39.556 +00:00] [WARN] [querynode/shard_node_detector.go:137] ["watch channel closed, retry..."] 2023-04-21 18:59:39 | [2023/04/21 10:59:39.556 +00:00] [WARN] [querynode/shard_node_detector.go:137] ["watch channel closed, retry..."]

furthermore, restarting qnode can solve this problem

MrPresent-Han commented 1 year ago

/assign congqixia please help have a look at this problem

yanliang567 commented 1 year ago

/unassign

congqixia commented 1 year ago

@MrPresent-Han @wangting0128 patch has been merged. Could you please verify?

congqixia commented 1 year ago

/unassign /assign @wangting0128

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.