milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.36k stars 2.91k forks source link

[Bug]: All search failed with error `fail to get shard leaders from QueryCoord: collection xxx is not fully loaded` after reinstallation or upgrade #24287

Closed zhuwenxing closed 1 year ago

zhuwenxing commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version:master-20230519-0b72cf2c
- Deployment mode(standalone or cluster):both
- MQ type(rocksmq, pulsar or kafka): pulsar and kafka   
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

[2023-05-21T10:01:15.875Z] + python3 scripts/action_after_reinstall.py --host 10.101.199.124 --data_size 3000

[2023-05-21T10:01:17.813Z] 2023-05-21 10:01:17.475 | INFO     | MainThread |__main__:<module>:45 - data size: 3000

[2023-05-21T10:01:17.813Z] 2023-05-21 10:01:17.531 | INFO     | MainThread |utils:get_collections:63 - 

[2023-05-21T10:01:17.813Z] List collections...

[2023-05-21T10:01:17.813Z] 2023-05-21 10:01:17.656 | INFO     | MainThread |utils:get_collections:65 - collections_nums: 5

[2023-05-21T10:01:18.068Z] 2023-05-21 10:01:18.037 | INFO     | MainThread |utils:get_collections:74 - task_1_FLAT: 6000

[2023-05-21T10:01:18.322Z] 2023-05-21 10:01:18.060 | INFO     | MainThread |utils:get_collections:74 - task_1_HNSW: 6000

[2023-05-21T10:01:18.322Z] 2023-05-21 10:01:18.073 | INFO     | MainThread |utils:get_collections:74 - task_1_IVF_FLAT: 6000

[2023-05-21T10:01:18.322Z] 2023-05-21 10:01:18.080 | INFO     | MainThread |utils:get_collections:74 - task_1_IVF_SQ8: 6000

[2023-05-21T10:01:18.322Z] 2023-05-21 10:01:18.088 | INFO     | MainThread |utils:get_collections:74 - task_1_IVF_PQ: 6000

[2023-05-21T10:01:18.322Z] 2023-05-21 10:01:18.088 | INFO     | MainThread |utils:load_and_search:197 - search data starts

[2023-05-21T10:01:18.322Z] 2023-05-21 10:01:18.088 | INFO     | MainThread |utils:get_collections:63 - 

[2023-05-21T10:01:18.322Z] List collections...

[2023-05-21T10:01:18.322Z] 2023-05-21 10:01:18.090 | INFO     | MainThread |utils:get_collections:65 - collections_nums: 5

[2023-05-21T10:01:18.322Z] 2023-05-21 10:01:18.097 | INFO     | MainThread |utils:get_collections:74 - task_1_FLAT: 6000

[2023-05-21T10:01:18.322Z] 2023-05-21 10:01:18.103 | INFO     | MainThread |utils:get_collections:74 - task_1_HNSW: 6000

[2023-05-21T10:01:18.322Z] 2023-05-21 10:01:18.109 | INFO     | MainThread |utils:get_collections:74 - task_1_IVF_FLAT: 6000

[2023-05-21T10:01:18.322Z] 2023-05-21 10:01:18.116 | INFO     | MainThread |utils:get_collections:74 - task_1_IVF_SQ8: 6000

[2023-05-21T10:01:18.322Z] 2023-05-21 10:01:18.122 | INFO     | MainThread |utils:get_collections:74 - task_1_IVF_PQ: 6000

[2023-05-21T10:01:18.322Z] 2023-05-21 10:01:18.125 | INFO     | MainThread |utils:load_and_search:201 - collection name: task_1_FLAT

[2023-05-21T10:01:18.322Z] 2023-05-21 10:01:18.125 | INFO     | MainThread |utils:load_and_search:202 - load collection

[2023-05-21T10:01:18.322Z] 2023-05-21 10:01:18.129 | INFO     | MainThread |utils:load_and_search:206 - load time: 0.0035

[2023-05-21T10:01:18.322Z] 2023-05-21 10:01:18.161 | INFO     | MainThread |utils:load_and_search:220 - {'metric_type': 'L2', 'params': {'nprobe': 10}}

[2023-05-21T10:01:18.322Z] 2023-05-21 10:01:18.161 | INFO     | MainThread |utils:load_and_search:223 - 

[2023-05-21T10:01:18.322Z] Search...

[2023-05-21T10:01:40.195Z] RPC error: [search], <MilvusException: (code=1, message=fail to search on all shard leaders, err=attempt #0: fail to get shard leaders from QueryCoord: collection 441623827145556478 is not fully loaded: attempt #1: fail to get shard leaders from QueryCoord: collection 441623827145556478 is not fully loaded: attempt #2: fail to get shard leaders from QueryCoord: collection 441623827145556478 is not fully loaded: attempt #3: fail to get shard leaders from QueryCoord: collection 441623827145556478 is not fully loaded: attempt #4: fail to get shard leaders from QueryCoord: collection 441623827145556478 is not fully loaded: attempt #5: fail to get shard leaders from QueryCoord: collection 441623827145556478 is not fully loaded: attempt #6: fail to get shard leaders from QueryCoord: collection 441623827145556478 is not fully loaded: context done during sleep after run#6: context deadline exceeded)>, <Time:{'RPC start': '2023-05-21 10:01:18.161461', 'RPC error': '2023-05-21 10:01:38.167097'}>

[2023-05-21T10:01:40.195Z] Traceback (most recent call last):

[2023-05-21T10:01:40.195Z]   File "scripts/action_after_reinstall.py", line 46, in <module>

[2023-05-21T10:01:40.195Z]     task_1(data_size, host)

[2023-05-21T10:01:40.195Z]   File "scripts/action_after_reinstall.py", line 14, in task_1

[2023-05-21T10:01:40.195Z]     load_and_search(prefix)

[2023-05-21T10:01:40.195Z]   File "/home/jenkins/agent/workspace/tests/python_client/deploy/scripts/utils.py", line 226, in load_and_search

[2023-05-21T10:01:40.195Z]     res = c.search(

[2023-05-21T10:01:40.195Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 666, in search

[2023-05-21T10:01:40.195Z]     res = conn.search(self._name, data, anns_field, param, limit, expr,

[2023-05-21T10:01:40.195Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler

[2023-05-21T10:01:40.195Z]     raise e

[2023-05-21T10:01:40.195Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler

[2023-05-21T10:01:40.196Z]     return func(*args, **kwargs)

[2023-05-21T10:01:40.196Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler

[2023-05-21T10:01:40.196Z]     ret = func(self, *args, **kwargs)

[2023-05-21T10:01:40.196Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler

[2023-05-21T10:01:40.196Z]     raise e

[2023-05-21T10:01:40.196Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler

[2023-05-21T10:01:40.196Z]     return func(self, *args, **kwargs)

[2023-05-21T10:01:40.196Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 521, in search

[2023-05-21T10:01:40.196Z]     return self._execute_search_requests(requests, timeout, round_decimal=round_decimal, auto_id=auto_id, **kwargs)

[2023-05-21T10:01:40.196Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 490, in _execute_search_requests

[2023-05-21T10:01:40.196Z]     raise pre_err

[2023-05-21T10:01:40.196Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 481, in _execute_search_requests

[2023-05-21T10:01:40.196Z]     raise MilvusException(response.status.error_code, response.status.reason)

[2023-05-21T10:01:40.196Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=attempt #0: fail to get shard leaders from QueryCoord: collection 441623827145556478 is not fully loaded: attempt #1: fail to get shard leaders from QueryCoord: collection 441623827145556478 is not fully loaded: attempt #2: fail to get shard leaders from QueryCoord: collection 441623827145556478 is not fully loaded: attempt #3: fail to get shard leaders from QueryCoord: collection 441623827145556478 is not fully loaded: attempt #4: fail to get shard leaders from QueryCoord: collection 441623827145556478 is not fully loaded: attempt #5: fail to get shard leaders from QueryCoord: collection 441623827145556478 is not fully loaded: attempt #6: fail to get shard leaders from QueryCoord: collection 441623827145556478 is not fully loaded: context done during sleep after run#6: context deadline exceeded)>

script returned exit code 1

— Shell Script
25s
[2023-05-21T10:01:16.164Z] + python3 scripts/second_recall_test.py --host 10.101.199.124

[2023-05-21T10:01:18.063Z] 2023-05-21 10:01:17.760 | INFO     | __main__:search_test:53 - recall test for index type HNSW

[2023-05-21T10:01:18.318Z] 2023-05-21 10:01:18.251 | INFO     | __main__:search_test:63 - 

[2023-05-21T10:01:18.318Z] Search...

[2023-05-21T10:01:40.195Z] RPC error: [search], <MilvusException: (code=1, message=fail to search on all shard leaders, err=attempt #0: fail to get shard leaders from QueryCoord: collection 441623827145756500 is not fully loaded: attempt #1: fail to get shard leaders from QueryCoord: collection 441623827145756500 is not fully loaded: attempt #2: fail to get shard leaders from QueryCoord: collection 441623827145756500 is not fully loaded: attempt #3: fail to get shard leaders from QueryCoord: collection 441623827145756500 is not fully loaded: attempt #4: fail to get shard leaders from QueryCoord: collection 441623827145756500 is not fully loaded: attempt #5: fail to get shard leaders from QueryCoord: collection 441623827145756500 is not fully loaded: attempt #6: fail to get shard leaders from QueryCoord: collection 441623827145756500 is not fully loaded: context done during sleep after run#6: context deadline exceeded)>, <Time:{'RPC start': '2023-05-21 10:01:18.252191', 'RPC error': '2023-05-21 10:01:38.461851'}>

[2023-05-21T10:01:40.195Z] Traceback (most recent call last):

[2023-05-21T10:01:40.195Z]   File "scripts/second_recall_test.py", line 103, in <module>

[2023-05-21T10:01:40.195Z]     search_test(host, index_type)

[2023-05-21T10:01:40.195Z]   File "scripts/second_recall_test.py", line 65, in search_test

[2023-05-21T10:01:40.195Z]     res = collection.search(

[2023-05-21T10:01:40.195Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 666, in search

[2023-05-21T10:01:40.195Z]     res = conn.search(self._name, data, anns_field, param, limit, expr,

[2023-05-21T10:01:40.195Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler

[2023-05-21T10:01:40.195Z]     raise e

[2023-05-21T10:01:40.195Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler

[2023-05-21T10:01:40.195Z]     return func(*args, **kwargs)

[2023-05-21T10:01:40.195Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler

[2023-05-21T10:01:40.195Z]     ret = func(self, *args, **kwargs)

[2023-05-21T10:01:40.195Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler

[2023-05-21T10:01:40.195Z]     raise e

[2023-05-21T10:01:40.195Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler

[2023-05-21T10:01:40.195Z]     return func(self, *args, **kwargs)

[2023-05-21T10:01:40.195Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 521, in search

[2023-05-21T10:01:40.195Z]     return self._execute_search_requests(requests, timeout, round_decimal=round_decimal, auto_id=auto_id, **kwargs)

[2023-05-21T10:01:40.195Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 490, in _execute_search_requests

[2023-05-21T10:01:40.195Z]     raise pre_err

[2023-05-21T10:01:40.195Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 481, in _execute_search_requests

[2023-05-21T10:01:40.195Z]     raise MilvusException(response.status.error_code, response.status.reason)

[2023-05-21T10:01:40.195Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=attempt #0: fail to get shard leaders from QueryCoord: collection 441623827145756500 is not fully loaded: attempt #1: fail to get shard leaders from QueryCoord: collection 441623827145756500 is not fully loaded: attempt #2: fail to get shard leaders from QueryCoord: collection 441623827145756500 is not fully loaded: attempt #3: fail to get shard leaders from QueryCoord: collection 441623827145756500 is not fully loaded: attempt #4: fail to get shard leaders from QueryCoord: collection 441623827145756500 is not fully loaded: attempt #5: fail to get shard leaders from QueryCoord: collection 441623827145756500 is not fully loaded: attempt #6: fail to get shard leaders from QueryCoord: collection 441623827145756500 is not fully loaded: context done during sleep after run#6: context deadline exceeded)>

script returned exit code 1

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_cron/detail/deploy_test_cron/810/pipeline

log:

artifacts-pulsar-cluster-reinstall-810-server-logs.tar.gz

artifacts-pulsar-cluster-reinstall-810-pytest-logs.tar.gz

Anything else?

some other failed jobs:

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_cron/detail/deploy_test_cron/812/pipeline log: artifacts-pulsar-cluster-upgrade-812-server-logs.tar.gz artifacts-pulsar-cluster-upgrade-812-pytest-logs.tar.gz

yanliang567 commented 1 year ago

/assign @jiaoew1991 /unassign

jiaoew1991 commented 1 year ago

the same as https://github.com/milvus-io/milvus/issues/23936

jiaoew1991 commented 1 year ago

/unassign /assign @zhuwenxing

please verify it with #24300

zhuwenxing commented 1 year ago

verified with master-20230524-b06e8815