milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
29.51k stars 2.83k forks source link

[Bug]: Search failed with error `Search 2 failed, reason query shard(channel) by-dev-rootcoord-dml_27_437271793527100077v1 does not exist` after querycoord pod kill #20480

Closed zhuwenxing closed 1 year ago

zhuwenxing commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version: wayblink-f-19955-2-2-3e06b23-20221110 (this image is based on 2.2.0 ,just add some logs)
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

[2022-11-10T06:12:04.345Z] [2022-11-10 06:11:45 - DEBUG - ci_test]: (api_request)  : [Collection.insert] args: [[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,......, kwargs: {'timeout': 120} (api_request.py:56)

[2022-11-10T06:12:04.345Z] [2022-11-10 06:11:45 - DEBUG - ci_test]: (api_response) : (insert count: 3000, delete count: 0, upsert count: 0, timestamp: {self._timestamp}, success count: {self.succ_count}, err count: {self.err_count})  (api_request.py:31)

[2022-11-10T06:12:04.345Z] [2022-11-10 06:11:45 - INFO - ci_test]: [test][2022-11-10T06:11:44Z] [0.44093080s] QueryChecker__cQYjhDVz insert -> (insert count: 3000, delete count: 0, upsert count: 0, timestamp: {self._timestamp}, success count: {self.succ_count}, err count: {self.err_count}) (wrapper.py:30)

[2022-11-10T06:12:04.345Z] [2022-11-10 06:11:45 - DEBUG - ci_test]: (api_request)  : [Collection.load] args: [None, 1, 120], kwargs: {} (api_request.py:56)

[2022-11-10T06:12:04.345Z] [2022-11-10 06:11:51 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)

[2022-11-10T06:12:04.345Z] [2022-11-10 06:11:51 - INFO - ci_test]: [test][2022-11-10T06:11:45Z] [6.03917251s] QueryChecker__cQYjhDVz load -> None (wrapper.py:30)

[2022-11-10T06:12:04.345Z] [2022-11-10 06:11:51 - INFO - ci_test]: assert load: 6.0393760204315186 (test_all_collections_after_chaos.py:93)

[2022-11-10T06:12:04.345Z] [2022-11-10 06:11:51 - DEBUG - ci_test]: (api_request)  : [Collection.search] args: [[[0.12095737788309362, 0.13085225741957515, 0.0755366551088928, 0.03449753100720238, 0.03679977934211998, 0.10286344713162897, 0.08521118046357733, 0.03733801514071556, 0.13650875797130016, 0.11917972330803649, 0.08824094872629969, 0.027427916097700667, 0.07031949222189257, 0.059202136793983104, 0......., kwargs: {} (api_request.py:56)

[2022-11-10T06:12:04.345Z] [2022-11-10 06:11:51 - ERROR - pymilvus.decorators]: RPC error: [search], <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=6, reason=Search 2 failed, reason query shard(channel)  by-dev-rootcoord-dml_27_437271793527100077v1  does not exist

[2022-11-10T06:12:04.345Z]  err %!w(<nil>))>, <Time:{'RPC start': '2022-11-10 06:11:51.445723', 'RPC error': '2022-11-10 06:11:51.796740'}> (decorators.py:108)

[2022-11-10T06:12:04.345Z] [2022-11-10 06:11:51 - ERROR - ci_test]: Traceback (most recent call last):

[2022-11-10T06:12:04.345Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 26, in inner_wrapper

[2022-11-10T06:12:04.345Z]     res = func(*args, **_kwargs)

[2022-11-10T06:12:04.345Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 57, in api_request

[2022-11-10T06:12:04.345Z]     return func(*arg, **kwargs)

[2022-11-10T06:12:04.345Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 719, in search

[2022-11-10T06:12:04.345Z]     res = conn.search(self._name, data, anns_field, param, limit, expr,

[2022-11-10T06:12:04.345Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler

[2022-11-10T06:12:04.345Z]     raise e

[2022-11-10T06:12:04.345Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler

[2022-11-10T06:12:04.345Z]     return func(*args, **kwargs)

[2022-11-10T06:12:04.345Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler

[2022-11-10T06:12:04.345Z]     ret = func(self, *args, **kwargs)

[2022-11-10T06:12:04.345Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler

[2022-11-10T06:12:04.345Z]     raise e

[2022-11-10T06:12:04.345Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler

[2022-11-10T06:12:04.345Z]     return func(self, *args, **kwargs)

[2022-11-10T06:12:04.345Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 478, in search

[2022-11-10T06:12:04.345Z]     return self._execute_search_requests(requests, timeout, **_kwargs)

[2022-11-10T06:12:04.345Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 442, in _execute_search_requests

[2022-11-10T06:12:04.345Z]     raise pre_err

[2022-11-10T06:12:04.345Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 433, in _execute_search_requests

[2022-11-10T06:12:04.345Z]     raise MilvusException(response.status.error_code, response.status.reason)

[2022-11-10T06:12:04.345Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=6, reason=Search 2 failed, reason query shard(channel)  by-dev-rootcoord-dml_27_437271793527100077v1  does not exist

[2022-11-10T06:12:04.345Z]  err %!w(<nil>))>

[2022-11-10T06:12:04.345Z]  (api_request.py:39)

[2022-11-10T06:12:04.345Z] [2022-11-10 06:11:51 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=6, reason=Search 2 failed, reason query shard(channel)  by-dev-rootcoord-dml_27_437271793527100077v1  does not exist

[2022-11-10T06:12:04.345Z]  err %!w(<nil>))> (api_request.py:40)

[2022-11-10T06:12:04.345Z] ------------- generated html file: file:///tmp/ci_logs/report.html -------------

[2022-11-10T06:12:04.345Z] =========================== short test summary info ============================

[2022-11-10T06:12:04.345Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[QueryChecker__cQYjhDVz] - AssertionError

[2022-11-10T06:12:04.345Z] =================== 1 failed, 12 passed in 97.57s (0:01:37) ====================

script returned exit code 1

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test/detail/chaos-test/2959/pipeline log: artifacts-querycoord-pod-kill-2959-server-logs.tar.gz artifacts-querycoord-pod-kill-2959-pytest-logs.tar.gz

Anything else?

No response

zhuwenxing commented 1 year ago

/assign @jiaoew1991

zhuwenxing commented 1 year ago

This also happens when no chaos or other interrupts. image

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_kafka_for_release/detail/deploy_test_kafka_for_release/370/pipeline log: artifacts-kafka-cluster-reinstall-370-server-second-deployment-logs.tar.gz

artifacts-kafka-cluster-reinstall-370-pytest-logs.tar.gz

[2022-11-12T13:17:31.447Z] <name>: deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_all_is_string_indexed_is_string_indexed_replica_number_1_is_deleted_is_deleted_data_size_3000

[2022-11-12T13:17:31.447Z] <partitions>: [{"name": "_default", "collection_name": "deploy_test_index_type_BIN_IVF_FLAT_i......  (api_request.py:31)

[2022-11-12T13:17:31.447Z] [2022-11-12 13:00:07 - INFO - ci_test]: inserted 3000 data into collection deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_all_is_string_indexed_is_string_indexed_replica_number_1_is_deleted_is_deleted_data_size_3000 (common_func.py:684)

[2022-11-12T13:17:31.447Z] [2022-11-12 13:00:07 - DEBUG - ci_test]: (api_request)  : [Collection.insert] args: [      int64   float varchar                                      binary_vector

[2022-11-12T13:17:31.447Z] 0         0     0.0       0  b'\x13\xa9\xf4\xcb&\x04`\x1c\xa5E-h\xb9=\x03\xfc'

[2022-11-12T13:17:31.447Z] 1         1     1.0       1     b'\xb8{%p0`\x1f\x19]\xcf\xf7\xee\xbe\xd3$\xa3'

[2022-11-12T13:17:31.447Z] 2         2     2.0       2  b'?\xed\\\xb9c0\xa6\x8d\x19[\x0e\......, kwargs: {'timeout': 120} (api_request.py:56)

[2022-11-12T13:17:31.447Z] [2022-11-12 13:00:07 - DEBUG - ci_test]: (api_response) : (insert count: 3000, delete count: 0, upsert count: 0, timestamp: {self._timestamp}, success count: {self.succ_count}, err count: {self.err_count})  (api_request.py:31)

[2022-11-12T13:17:31.448Z] [2022-11-12 13:00:07 - DEBUG - ci_test]: (api_request)  : [Collection.search] args: [[b'\x01\xd9\xddA:\x94mt\x0c*\x97\xcc_L\xcd\xa4', b'\xa8\x91\x9d9\xd1@E\x00\x16\x9b\n\xd4]\xe6\xb2\xa6'], 'binary_vector', {'metric_type': 'HAMMING', 'params': {'nprobe': 10}}, 10, 'int64 >= 0', None, None, 120, -1], kwargs: {} (api_request.py:56)

[2022-11-12T13:17:31.448Z] [2022-11-12 13:00:09 - ERROR - pymilvus.decorators]: RPC error: [search], <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=9, reason=Search 5 failed, reason query shard(channel)  by-dev-rootcoord-dml_98_437323739021150586v0  does not exist

[2022-11-12T13:17:31.448Z]  err %!w(<nil>))>, <Time:{'RPC start': '2022-11-12 13:00:07.910375', 'RPC error': '2022-11-12 13:00:09.333396'}> (decorators.py:108)

[2022-11-12T13:17:31.448Z] [2022-11-12 13:00:09 - ERROR - ci_test]: Traceback (most recent call last):

[2022-11-12T13:17:31.448Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 26, in inner_wrapper

[2022-11-12T13:17:31.448Z]     res = func(*args, **_kwargs)

[2022-11-12T13:17:31.448Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 57, in api_request

[2022-11-12T13:17:31.448Z]     return func(*arg, **kwargs)

[2022-11-12T13:17:31.448Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 658, in search

[2022-11-12T13:17:31.448Z]     res = conn.search(self._name, data, anns_field, param, limit, expr,

[2022-11-12T13:17:31.448Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler

[2022-11-12T13:17:31.448Z]     raise e

[2022-11-12T13:17:31.448Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler

[2022-11-12T13:17:31.448Z]     return func(*args, **kwargs)

[2022-11-12T13:17:31.448Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler

[2022-11-12T13:17:31.448Z]     ret = func(self, *args, **kwargs)

[2022-11-12T13:17:31.448Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler

[2022-11-12T13:17:31.448Z]     raise e

[2022-11-12T13:17:31.448Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler

[2022-11-12T13:17:31.448Z]     return func(self, *args, **kwargs)

[2022-11-12T13:17:31.448Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 478, in search

[2022-11-12T13:17:31.448Z]     return self._execute_search_requests(requests, timeout, **_kwargs)

[2022-11-12T13:17:31.448Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 442, in _execute_search_requests

[2022-11-12T13:17:31.448Z]     raise pre_err

[2022-11-12T13:17:31.448Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 433, in _execute_search_requests

[2022-11-12T13:17:31.448Z]     raise MilvusException(response.status.error_code, response.status.reason)

[2022-11-12T13:17:31.448Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=9, reason=Search 5 failed, reason query shard(channel)  by-dev-rootcoord-dml_98_437323739021150586v0  does not exist

[2022-11-12T13:17:31.448Z]  err %!w(<nil>))>

[2022-11-12T13:17:31.448Z]  (api_request.py:39)

[2022-11-12T13:17:31.448Z] [2022-11-12 13:00:09 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=9, reason=Search 5 failed, reason query shard(channel)  by-dev-rootcoord-dml_98_437323739021150586v0  does not exist

[2022-11-12T13:17:31.448Z]  err %!w(<nil>))> (api_request.py:40)

[2022-11-12T13:17:31.448Z] [2022-11-12 13:00:09 - INFO - ci_test]: search_results_check: checking the searching results (func_check.py:208)[get_env_variable] failed to get environment variables : 'CI_LOG_PATH', use default path : /tmp/ci_logs

[2022-11-12T13:17:31.448Z] [create_path] folder(/tmp/ci_logs) is not exist.

[2022-11-12T13:17:31.448Z] [create_path] create path now...

[2022-11-12T13:17:31.448Z] 

[2022-11-12T13:17:31.448Z] ------------- generated html file: file:///tmp/ci_logs/report.html -------------

[2022-11-12T13:17:31.448Z] =========================== short test summary info ============================

[2022-11-12T13:17:31.448Z] FAILED testcases/test_action_first_deployment.py::TestActionFirstDeployment::test_task_all[BIN_IVF_FLAT-all-is_string_indexed-is_deleted-is_compacted-1] - TypeError: object of type 'Error' has no len()

[2022-11-12T13:17:31.448Z] ============ 1 failed, 25 passed, 24 skipped in 1252.18s (0:20:52) =============

script returned exit code 1
jiaoew1991 commented 1 year ago

/assign @yah01

yah01 commented 1 year ago

/assign @zhuwenxing plz retry with #20592 for v2.2

zhuwenxing commented 1 year ago

Not reproduced in 2.2.0-20221126-dd10e571