milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
29.1k stars 2.79k forks source link

[Bug]: [ResourceGroup]Fail to search after transferring all replicas from rgA to rgB #22461

Closed yanliang567 closed 1 year ago

yanliang567 commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version: 2.2.0-20230228-91d251ab
- Deployment mode(standalone or cluster): cluster with 8 querynodes
- MQ type(rocksmq, pulsar or kafka):    pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): 2.2.3 dev3

Current Behavior

search fail after transferring all replicas from rgA to rgB

raise MilvusException(response.status.error_code, response.status.reason)
pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=All attempts results:
attempt #1:code: UnexpectedError, error: fail to Search, QueryNode ID=29, reason=query shard(channel)  yanliang-cluster-1cu-rootcoord-dml_228_439742840803054281v1  does not exist

attempt #2:context canceled

Expected Behavior

search keeps successfully

Steps To Reproduce

1. create a collection, insert data and build index
2. prepare 2 resource groups: rgA with 2 querynodes and rgB with 3 querynodes
3. load the collection with 2 replicas into rgA
4. transfer all the 2 replicas from rgA to rgB
5. search

Milvus Log

No response

Anything else?

if you check the replica info immediately after transfer_replica, you will find one of the replica occupied all the 3 nodes in rgB, which i think was wrong.

[2023-02-28 10:31:06,904 - DEBUG - ci_test]: (api_response) : Replica groups:
- Group: <group_id:439742841207128226>, <group_nodes:(26, 29, 28)>, <shards:[Shard: <channel_name:yanliang-cluster-1cu-rootcoord-dml_227_439742840803054281v0>, <shard_leader:29>, <shard_nodes:[29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29]>, Shard......  (api_request.py:31)
yanliang567 commented 1 year ago

it seems to be an occasional issue, even there are 3 nodes in group_nodes after transfer_node

yanliang567 commented 1 year ago

/assign @weiliu1031 /unassign

yanliang567 commented 1 year ago

logs around 2023-02-28 10:31:06 beijing time:

yanliang-cluster-1cu-milvus-datanode-5d9b77c4c4-hfpcx             1/1     Running     0               3h28m
yanliang-cluster-1cu-milvus-indexnode-5d66d768c7-qmqxf            1/1     Running     0               3h28m
yanliang-cluster-1cu-milvus-mixcoord-78b9d986b-cjtrb              1/1     Running     0               165m
yanliang-cluster-1cu-milvus-proxy-c9b757b8-hrdnl                  1/1     Running     0               3h28m
yanliang-cluster-1cu-milvus-querynode-579b98b646-2jmp7            1/1     Running     0               3h26m
yanliang-cluster-1cu-milvus-querynode-579b98b646-4kqh5            1/1     Running     0               3h24m
yanliang-cluster-1cu-milvus-querynode-579b98b646-b9vkw            1/1     Running     0               3h28m
yanliang-cluster-1cu-milvus-querynode-579b98b646-c5xms            1/1     Running     0               3h28m
yanliang-cluster-1cu-milvus-querynode-579b98b646-cjmlt            1/1     Running     0               3h22m
yanliang-cluster-1cu-milvus-querynode-579b98b646-gpmrl            1/1     Running     0               3h26m
yanliang-cluster-1cu-milvus-querynode-579b98b646-mswj8            1/1     Running     0               3h22m
yanliang-cluster-1cu-milvus-querynode-579b98b646-trb6h            1/1     Running     0               3h22m
weiliu1031 commented 1 year ago

seems that mixcoord has restarted, and all operation details about search failed collection can't be found in logs, expect fot next appear

weiliu1031 commented 1 year ago

/assign @yanliang567

yanliang567 commented 1 year ago

not reproduced recently. close for now