milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.95k stars 2.95k forks source link

[Bug]: [ResourceGroup] After recovering from QueryCoord Pod Kill, failed to get shard leaders from QueryCoord: collection is not fully loaded #22318

Closed ThreadDao closed 1 year ago

ThreadDao commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version: master-20230221-b7c0d12d
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):   pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 2.3.0.dev38
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

  1. Deploy a cluster with 4 querynodes
  2. Create rg RG_0 and transfer 1 qn from default_rg into RG_0, create rg RG_1 and transfer 1 qn from default_rg into RG_1
  3. Create collection ResourceGroup_111,
  4. Insert 1w entities, flush
  5. Create index {"index_type": "HNSW", "metric_type": "L2", "params": {"M": 48, "efConstruction": 500}}
  6. Load collection ResourceGroup_111 with 2 replicas into two rgs [RG_0, RG_1]
  7. Search successfully
  8. pod kill querycoord
  9. sleep 6m
  10. Search gets error
    
    [2023-02-21T04:08:57.326Z] [2023-02-21 04:08:57 - ERROR - pymilvus.decorators]: RPC error: [search], <MilvusException: (code=1, message=fail to search on all shard leaders, err=All attempts results:

[2023-02-21T04:08:57.326Z] attempt #1:fail to get shard leaders from QueryCoord: collection 439602829507890674 is not fully loaded

[2023-02-21T04:08:57.326Z] attempt #2:fail to get shard leaders from QueryCoord: collection 439602829507890674 is not fully loaded

[2023-02-21T04:08:57.326Z] attempt #3:fail to get shard leaders from QueryCoord: collection 439602829507890674 is not fully loaded

[2023-02-21T04:08:57.326Z] attempt #4:fail to get shard leaders from QueryCoord: collection 439602829507890674 is not fully loaded

[2023-02-21T04:08:57.326Z] attempt #5:fail to get shard leaders from QueryCoord: collection 439602829507890674 is not fully loaded

[2023-02-21T04:08:57.326Z] attempt #6:fail to get shard leaders from QueryCoord: collection 439602829507890674 is not fully loaded

[2023-02-21T04:08:57.326Z] attempt #7:fail to get shard leaders from QueryCoord: collection 439602829507890674 is not fully loaded

[2023-02-21T04:08:57.326Z] attempt #8:context deadline exceeded

[2023-02-21T04:08:57.326Z] )>, <Time:{'RPC start': '2023-02-21 04:08:47.221457', 'RPC error': '2023-02-21 04:08:57.282384'}> (decorators.py:108)

[2023-02-21T04:08:57.326Z] [2023-02-21 04:08:57 - ERROR - ci_test]: Traceback (most recent call last):

[2023-02-21T04:08:57.326Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 26, in inner_wrapper

[2023-02-21T04:08:57.326Z] res = func(*args, **_kwargs)

[2023-02-21T04:08:57.326Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 57, in api_request

[2023-02-21T04:08:57.326Z] return func(*arg, **kwargs)

[2023-02-21T04:08:57.326Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 614, in search

[2023-02-21T04:08:57.326Z] res = conn.search(self._name, data, anns_field, param, limit, expr,

[2023-02-21T04:08:57.326Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler

[2023-02-21T04:08:57.326Z] raise e

[2023-02-21T04:08:57.326Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler

[2023-02-21T04:08:57.326Z] return func(*args, **kwargs)

[2023-02-21T04:08:57.326Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler

[2023-02-21T04:08:57.326Z] ret = func(self, *args, **kwargs)

[2023-02-21T04:08:57.327Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler

[2023-02-21T04:08:57.327Z] raise e

[2023-02-21T04:08:57.327Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler

[2023-02-21T04:08:57.327Z] return func(self, *args, **kwargs)

[2023-02-21T04:08:57.327Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 483, in search

[2023-02-21T04:08:57.327Z] return self._execute_search_requests(requests, timeout, round_decimal=round_decimal, auto_id=auto_id, **kwargs)

[2023-02-21T04:08:57.327Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 452, in _execute_search_requests

[2023-02-21T04:08:57.327Z] raise pre_err

[2023-02-21T04:08:57.327Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 443, in _execute_search_requests

[2023-02-21T04:08:57.327Z] raise MilvusException(response.status.error_code, response.status.reason)

[2023-02-21T04:08:57.327Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=All attempts results:

[2023-02-21T04:08:57.327Z] attempt #1:fail to get shard leaders from QueryCoord: collection 439602829507890674 is not fully loaded

[2023-02-21T04:08:57.327Z] attempt #2:fail to get shard leaders from QueryCoord: collection 439602829507890674 is not fully loaded

[2023-02-21T04:08:57.327Z] attempt #3:fail to get shard leaders from QueryCoord: collection 439602829507890674 is not fully loaded

[2023-02-21T04:08:57.327Z] attempt #4:fail to get shard leaders from QueryCoord: collection 439602829507890674 is not fully loaded

[2023-02-21T04:08:57.327Z] attempt #5:fail to get shard leaders from QueryCoord: collection 439602829507890674 is not fully loaded

[2023-02-21T04:08:57.327Z] attempt #6:fail to get shard leaders from QueryCoord: collection 439602829507890674 is not fully loaded

[2023-02-21T04:08:57.327Z] attempt #7:fail to get shard leaders from QueryCoord: collection 439602829507890674 is not fully loaded

[2023-02-21T04:08:57.327Z] attempt #8:context deadline exceeded

[2023-02-21T04:08:57.327Z] )>

[2023-02-21T04:08:57.327Z] (api_request.py:39)


After about 3 hours, collection is stll not fully loaded, get query segment info return empty, get replicas return error:

c.get_query_segment_info('ResourceGroup_111') [] c.get_replicas('ResourceGroup_111') RPC error: [get_replicas], <MilvusException: (code=15, message=failed to get replica info, err=failed to get channels, collection not loaded[CollectionNotFound])>, <Time:{'RPC start': '2023-02-21 14:00:47.369003', 'RPC error': '2023-02-21 14:00:47.411205'}> Traceback (most recent call last): File "", line 1, in File "/Users/nausicca/.virtualenvs/milvus/lib/python3.8/site-packages/pymilvus/client/stub.py", line 1047, in get_replicas return handler.get_replicas(collection_name, timeout=timeout, kwargs) File "/Users/nausicca/.virtualenvs/milvus/lib/python3.8/site-packages/pymilvus/decorators.py", line 109, in handler raise e File "/Users/nausicca/.virtualenvs/milvus/lib/python3.8/site-packages/pymilvus/decorators.py", line 105, in handler return func(*args, *kwargs) File "/Users/nausicca/.virtualenvs/milvus/lib/python3.8/site-packages/pymilvus/decorators.py", line 136, in handler ret = func(self, args, kwargs) File "/Users/nausicca/.virtualenvs/milvus/lib/python3.8/site-packages/pymilvus/decorators.py", line 85, in handler raise e File "/Users/nausicca/.virtualenvs/milvus/lib/python3.8/site-packages/pymilvus/decorators.py", line 50, in handler return func(self, *args, **kwargs) File "/Users/nausicca/.virtualenvs/milvus/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 1041, in get_replicas raise MilvusException(response.status.error_code, response.status.reason) pymilvus.exceptions.MilvusException: <MilvusException: (code=15, message=failed to get replica info, err=failed to get channels, collection not loaded[CollectionNotFound])>


By the way, the collection `ResourceGroup_222` loaded by the default resource group can be searched normally after the chaos recovered, but `get_replicas` is also raise exception:

<MilvusException: (code=15, message=failed to get replica info, err=failed to get shard leader for shard collectionID:439602829507890740 channelName:"by-dev-rootcoord-dml_7_439602829507890740v1" seek_position:<channel_name:"by-dev-rootcoord-dml_7_439602829507890740v1" msgID:"\010\007\020=\030\000 \000" msgGroup:"datanode-8-by-dev-rootcoord-dml_7_439602829507890740v1-true" timestamp:439602877892657154 > flushedSegmentIds:439602829507890750 , the collection not loaded or leader is offline[NodeNotFound(0)])


### Expected Behavior

The collection loaded by the `RG_0`, `RG_1` search successfully

### Steps To Reproduce

```markdown
Jenkins pipeline:  chaos-test-resource-group kill querycoord

Milvus Log

chaos type: pod-kill image tag: master-20230221-b7c0d12d target pod: querycoord failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-resource-group/detail/chaos-test-resource-group/8/pipeline

log: artifacts-querycoord-pod-kill-8-server-logs.tar.gz

artifacts-querycoord-pod-kill-8-pytest-logs.tar.gz

Anything else?

No response

ThreadDao commented 1 year ago

By the way, resource group info tell me that collection is loaded:

ResourceGroupInfo:
        <name:RG_0>, 
        <capacity:1>, 
        <num_available_node:1>, 
        <num_loaded_replica:{'ResourceGroup_111': 1}>, 
        <num_outgoing_node:{}>, 
        <num_incoming_node:{}>

ResourceGroupInfo:
        <name:RG_1>, 
        <capacity:1>, 
        <num_available_node:1>, 
        <num_loaded_replica:{'ResourceGroup_111': 1}>, 
        <num_outgoing_node:{}>, 
        <num_incoming_node:{}> 

<name:__default_resource_group>, 
        <capacity:1000000>, 
        <num_available_node:2>, 
        <num_loaded_replica:{'Checker__CuPg5fdN': 1, 'SearchChecker__SklasCg6': 1, 'Hello_Milvus': 1, 'Checker__y2GOEz7y': 1, 'DeleteChecker__tDpoe0M2': 1, 'InsertChecker__rzJLYNVb': 1, 'QueryChecker__F8VgFuBz': 1, 'IndexChecker__8BWNEE5P': 1, 'FlushChecker__tuTHHnqt': 1, 'ResourceGroup_222': 2, 'Checker__G8vrZ3zA': 1, 'Checker__ABlBpC4n': 1, 'Checker__T8KwV6BY': 1, 'CreateChecker__cS1h9MmY': 1}>, 
        <num_outgoing_node:{}>, 
        <num_incoming_node:{}>
yanliang567 commented 1 year ago

/assign @weiliu1031 /unassign

weiliu1031 commented 1 year ago

fix on #22370 please verify on this

weiliu1031 commented 1 year ago

/assign @yanliang567

yanliang567 commented 1 year ago

/assign @ThreadDao please help to verify the fix

ThreadDao commented 1 year ago

fixed 2.2.0-20230228-3e560841 verified Jenkins job: https://qa-jenkins.milvus.io/job/chaos-test-resource-group/75/