milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.06k stars 2.88k forks source link

[Bug]: Search failed with error `fail to get shard leaders from QueryCoord: channel xxx is not available in any replica, err=NodeOffline(nodeID=6)` after querynode pod kill chaos test #22874

Closed zhuwenxing closed 1 year ago

zhuwenxing commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version:2.2.0-20230320-61692278
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):pulsar    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

image

[2023-03-20T21:49:03.928Z] [2023-03-20 21:48:14 - INFO - ci_test]: [test][2023-03-20T21:48:11Z] [3.03053532s] Checker__HR2tspeW flush -> None (wrapper.py:30)

[2023-03-20T21:49:03.928Z] [2023-03-20 21:48:14 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 120} (api_request.py:56)

[2023-03-20T21:49:03.928Z] [2023-03-20 21:48:14 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)

[2023-03-20T21:49:03.928Z] [2023-03-20 21:48:14 - INFO - ci_test]: [test][2023-03-20T21:48:14Z] [0.00698065s] Checker__HR2tspeW flush -> None (wrapper.py:30)

[2023-03-20T21:49:03.928Z] [2023-03-20 21:48:14 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 120} (api_request.py:56)

[2023-03-20T21:49:03.928Z] [2023-03-20 21:48:14 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)

[2023-03-20T21:49:03.928Z] [2023-03-20 21:48:14 - INFO - ci_test]: [test][2023-03-20T21:48:14Z] [0.00360574s] Checker__HR2tspeW flush -> None (wrapper.py:30)

[2023-03-20T21:49:03.928Z] [2023-03-20 21:48:14 - INFO - ci_test]: assert flush: 3.0396857261657715, entities: 19020 (test_all_collections_after_chaos.py:58)

[2023-03-20T21:49:03.928Z] [2023-03-20 21:48:14 - INFO - ci_test]: index info: [{'collection': 'Checker__HR2tspeW', 'field': 'float_vector', 'index_name': 'index__mnykDUZd', 'index_param': {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 48, 'efConstruction': 500}}}] (test_all_collections_after_chaos.py:74)

[2023-03-20T21:49:03.928Z] [2023-03-20 21:48:14 - DEBUG - ci_test]: (api_request)  : [Collection.load] args: [None, 1, 120], kwargs: {} (api_request.py:56)

[2023-03-20T21:49:03.928Z] [2023-03-20 21:48:14 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)

[2023-03-20T21:49:03.928Z] [2023-03-20 21:48:14 - INFO - ci_test]: [test][2023-03-20T21:48:14Z] [0.00448959s] Checker__HR2tspeW load -> None (wrapper.py:30)

[2023-03-20T21:49:03.928Z] [2023-03-20 21:48:14 - DEBUG - ci_test]: (api_request)  : [Collection.search] args: [[[0.03182161123272601, 0.08796431210103264, 0.014589720336119251, 0.09313150977568185, 0.028573969443960852, 0.09639338186923978, 0.09082574398990807, 0.018454380323319308, 0.03608302634609229, 0.14636737050714285, 0.06430762239087241, 0.030251217214846738, 0.10285532016496038, 0.1341571927041005, ......, kwargs: {} (api_request.py:56)

[2023-03-20T21:49:03.928Z] [2023-03-20 21:48:24 - ERROR - pymilvus.decorators]: RPC error: [search], <MilvusException: (code=1, message=fail to search on all shard leaders, err=All attempts results:

[2023-03-20T21:49:03.928Z] attempt #1:All attempts results:

[2023-03-20T21:49:03.928Z] attempt #1:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_14_440231040547491749v0 is not available in any replica, err=NodeOffline(nodeID=12)

[2023-03-20T21:49:03.928Z] attempt #2:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_14_440231040547491749v0 is not available in any replica, err=NodeOffline(nodeID=12)

[2023-03-20T21:49:03.928Z] attempt #3:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_14_440231040547491749v0 is not available in any replica, err=NodeOffline(nodeID=12)

[2023-03-20T21:49:03.928Z] attempt #4:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_14_440231040547491749v0 is not available in any replica, err=NodeOffline(nodeID=12)

[2023-03-20T21:49:03.928Z] attempt #5:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_14_440231040547491749v0 is not available in any replica, err=NodeOffline(nodeID=12)

[2023-03-20T21:49:03.928Z] attempt #6:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_14_440231040547491749v0 is not available in any replica, err=NodeOffline(nodeID=12)

[2023-03-20T21:49:03.928Z] attempt #7:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_14_440231040547491749v0 is not available in any replica, err=NodeOffline(nodeID=12)

[2023-03-20T21:49:03.928Z] attempt #8:context deadline exceeded

[2023-03-20T21:49:03.928Z] 

[2023-03-20T21:49:03.928Z] attempt #2:context canceled

[2023-03-20T21:49:03.928Z] )>, <Time:{'RPC start': '2023-03-20 21:48:14.685011', 'RPC error': '2023-03-20 21:48:24.687576'}> (decorators.py:108)

[2023-03-20T21:49:03.928Z] [2023-03-20 21:48:24 - ERROR - ci_test]: Traceback (most recent call last):

[2023-03-20T21:49:03.928Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 26, in inner_wrapper

[2023-03-20T21:49:03.928Z]     res = func(*args, **_kwargs)

[2023-03-20T21:49:03.928Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 57, in api_request

[2023-03-20T21:49:03.928Z]     return func(*arg, **kwargs)

[2023-03-20T21:49:03.928Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 660, in search

[2023-03-20T21:49:03.928Z]     res = conn.search(self._name, data, anns_field, param, limit, expr,

[2023-03-20T21:49:03.928Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler

[2023-03-20T21:49:03.928Z]     raise e

[2023-03-20T21:49:03.928Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler

[2023-03-20T21:49:03.928Z]     return func(*args, **kwargs)

[2023-03-20T21:49:03.928Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler

[2023-03-20T21:49:03.928Z]     ret = func(self, *args, **kwargs)

[2023-03-20T21:49:03.928Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler

[2023-03-20T21:49:03.928Z]     raise e

[2023-03-20T21:49:03.928Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler

[2023-03-20T21:49:03.928Z]     return func(self, *args, **kwargs)

[2023-03-20T21:49:03.928Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 518, in search

[2023-03-20T21:49:03.928Z]     return self._execute_search_requests(requests, timeout, round_decimal=round_decimal, auto_id=auto_id, **kwargs)

[2023-03-20T21:49:03.928Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 487, in _execute_search_requests

[2023-03-20T21:49:03.928Z]     raise pre_err

[2023-03-20T21:49:03.928Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 478, in _execute_search_requests

[2023-03-20T21:49:03.928Z]     raise MilvusException(response.status.error_code, response.status.reason)

[2023-03-20T21:49:03.928Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=All attempts results:

[2023-03-20T21:49:03.928Z] attempt #1:All attempts results:

[2023-03-20T21:49:03.928Z] attempt #1:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_14_440231040547491749v0 is not available in any replica, err=NodeOffline(nodeID=12)

[2023-03-20T21:49:03.928Z] attempt #2:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_14_440231040547491749v0 is not available in any replica, err=NodeOffline(nodeID=12)

[2023-03-20T21:49:03.928Z] attempt #3:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_14_440231040547491749v0 is not available in any replica, err=NodeOffline(nodeID=12)

[2023-03-20T21:49:03.928Z] attempt #4:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_14_440231040547491749v0 is not available in any replica, err=NodeOffline(nodeID=12)

[2023-03-20T21:49:03.928Z] attempt #5:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_14_440231040547491749v0 is not available in any replica, err=NodeOffline(nodeID=12)

[2023-03-20T21:49:03.928Z] attempt #6:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_14_440231040547491749v0 is not available in any replica, err=NodeOffline(nodeID=12)

[2023-03-20T21:49:03.928Z] attempt #7:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_14_440231040547491749v0 is not available in any replica, err=NodeOffline(nodeID=12)

[2023-03-20T21:49:03.929Z] attempt #8:context deadline exceeded

[2023-03-20T21:49:03.929Z] 

[2023-03-20T21:49:03.929Z] attempt #2:context canceled

[2023-03-20T21:49:03.929Z] )>

[2023-03-20T21:49:03.929Z]  (api_request.py:39)

[2023-03-20T21:49:03.929Z] [2023-03-20 21:48:24 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=fail to search on all shard leaders, err=All attempts results:

[2023-03-20T21:49:03.929Z] attempt #1:All attempts results:

[2023-03-20T21:49:03.929Z] attempt #1:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_14_440231040547491749v0 is not available in any replica, err=NodeOffline(nodeID=12)

[2023-03-20T21:49:03.929Z] at...... (api_request.py:40)

[2023-03-20T21:49:03.929Z] ------------- generated html file: file:///tmp/ci_logs/report.html -------------

[2023-03-20T21:49:03.929Z] =========================== short test summary info ============================

[2023-03-20T21:49:03.929Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__04lMSNU8] - AssertionError

[2023-03-20T21:49:03.929Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__BKdAj6N7] - AssertionError

[2023-03-20T21:49:03.929Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__4ADLg3hj] - AssertionError

[2023-03-20T21:49:03.929Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__CDGnoBBh] - AssertionError

[2023-03-20T21:49:03.929Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__HR2tspeW] - AssertionError

[2023-03-20T21:49:03.929Z] =================== 5 failed, 7 passed in 135.30s (0:02:15) ====================

Expected Behavior

all test cases passed

Steps To Reproduce

No response

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-for-release-cron/detail/chaos-test-for-release-cron/2848/pipeline log:

artifacts-querynode-pod-kill-2848-server-logs.tar.gz

artifacts-querynode-pod-kill-2848-pytest-logs.tar.gz

Anything else?

No response

zhuwenxing commented 1 year ago

It is a stable reproduced issue when mq is pulsar. But when the mq is Kafka, this issue is not reproduced.

image

zhuwenxing commented 1 year ago

chaos type: pod-kill image tag: 2.2.0-20230317-bbc21fe8 target pod: querynode https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-for-release-cron/detail/chaos-test-for-release-cron/2800/pipeline log: artifacts-querynode-pod-kill-2800-server-logs.tar.gz artifacts-querynode-pod-kill-2800-pytest-logs.tar.gz

yanliang567 commented 1 year ago

/assign @jiaoew1991 /unassign

jiaoew1991 commented 1 year ago

/assign @weiliu1031 /unassign

weiliu1031 commented 1 year ago

root cause:

  1. query node: when shard leader watch node down, instead of remove segment on the node directly, it will mark the segment as offline state. and then pass to query coord by getDataDistribution
  2. query coord: when query coord found a segment in leader view and not in distribution, it won't told shard leader to sync its distribution, so the wrong distribution will keep forever.
  3. when get shard leader request came, it will check the leader view, and raise error with query node is not available.
zhuwenxing commented 1 year ago

It was still reproduced in 2.2.5-20230328-0c729b3c

[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:09 - INFO - ci_test]: [test][2023-03-28T21:51:09Z] [0.00418141s] SearchChecker__IkhBYrkr flush -> None (wrapper.py:30)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:09 - INFO - ci_test]: assert create collection: 0.008777141571044922, init_entities: 3000 (test_all_collections_after_chaos.py:41)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:09 - DEBUG - ci_test]: (api_request)  : [Collection.insert] args: [[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,......, kwargs: {'timeout': 120} (api_request.py:56)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:09 - DEBUG - ci_test]: (api_response) : (insert count: 3000, delete count: 0, upsert count: 0, timestamp: 440412476506374146, success count: 3000, err count: 0)  (api_request.py:31)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:09 - INFO - ci_test]: [test][2023-03-28T21:51:09Z] [0.33110637s] SearchChecker__IkhBYrkr insert -> (insert count: 3000, delete count: 0, upsert count: 0, timestamp: 440412476506374146, success count: 3000, err count: 0) (wrapper.py:30)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:09 - INFO - ci_test]: assert insert: 0.33153223991394043 (test_all_collections_after_chaos.py:48)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:09 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:56)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:12 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:12 - INFO - ci_test]: [test][2023-03-28T21:51:09Z] [2.51892531s] SearchChecker__IkhBYrkr flush -> None (wrapper.py:30)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:12 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 120} (api_request.py:56)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:12 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:12 - INFO - ci_test]: [test][2023-03-28T21:51:12Z] [0.00451105s] SearchChecker__IkhBYrkr flush -> None (wrapper.py:30)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:12 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 120} (api_request.py:56)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:12 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:12 - INFO - ci_test]: [test][2023-03-28T21:51:12Z] [0.00556729s] SearchChecker__IkhBYrkr flush -> None (wrapper.py:30)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:12 - INFO - ci_test]: assert flush: 2.526207685470581, entities: 6000 (test_all_collections_after_chaos.py:58)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:12 - INFO - ci_test]: index info: [{'collection': 'SearchChecker__IkhBYrkr', 'field': 'float_vector', 'index_name': 'index__5HYIWMzk', 'index_param': {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 48, 'efConstruction': 500}}}] (test_all_collections_after_chaos.py:74)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:12 - DEBUG - ci_test]: (api_request)  : [Collection.load] args: [None, 1, 120], kwargs: {} (api_request.py:56)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:12 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:12 - INFO - ci_test]: [test][2023-03-28T21:51:12Z] [0.00455358s] SearchChecker__IkhBYrkr load -> None (wrapper.py:30)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:12 - DEBUG - ci_test]: (api_request)  : [Collection.search] args: [[[0.04400977043543584, 0.08573560469596708, 0.12294614678739231, 0.02797168480160405, 0.04778780117876424, 0.08090326769733451, 0.13277738449640375, 0.021937854194746264, 0.11970723408568698, 0.0921315321838237, 0.07324508852649558, 0.05507059532700562, 0.14186892821346925, 0.046999774639943, 0.059......, kwargs: {} (api_request.py:56)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:22 - ERROR - pymilvus.decorators]: RPC error: [search], <MilvusException: (code=1, message=fail to search on all shard leaders, err=All attempts results:
[2023-03-28T21:51:42.736Z] attempt #1:All attempts results:
[2023-03-28T21:51:42.736Z] attempt #1:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_23_440412283720634906v1 is not available in any replica, err=LackSegment(segmentID=440412283720634918)
[2023-03-28T21:51:42.736Z] attempt #2:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_23_440412283720634906v1 is not available in any replica, err=LackSegment(segmentID=440412283720634918)
[2023-03-28T21:51:42.736Z] attempt #3:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_23_440412283720634906v1 is not available in any replica, err=LackSegment(segmentID=440412283720634918)
[2023-03-28T21:51:42.736Z] attempt #4:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_23_440412283720634906v1 is not available in any replica, err=LackSegment(segmentID=440412283720634918)
[2023-03-28T21:51:42.736Z] attempt #5:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_23_440412283720634906v1 is not available in any replica, err=LackSegment(segmentID=440412283720634918)
[2023-03-28T21:51:42.736Z] attempt #6:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_23_440412283720634906v1 is not available in any replica, err=LackSegment(segmentID=440412283720634918)
[2023-03-28T21:51:42.736Z] attempt #7:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_23_440412283720634906v1 is not available in any replica, err=LackSegment(segmentID=440412283720634918)
[2023-03-28T21:51:42.736Z] attempt #8:context deadline exceeded
[2023-03-28T21:51:42.736Z] 
[2023-03-28T21:51:42.736Z] attempt #2:context canceled
[2023-03-28T21:51:42.736Z] )>, <Time:{'RPC start': '2023-03-28 21:51:12.479884', 'RPC error': '2023-03-28 21:51:22.482701'}> (decorators.py:108)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:22 - ERROR - ci_test]: Traceback (most recent call last):
[2023-03-28T21:51:42.736Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 26, in inner_wrapper
[2023-03-28T21:51:42.736Z]     res = func(*args, **_kwargs)
[2023-03-28T21:51:42.737Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 57, in api_request
[2023-03-28T21:51:42.737Z]     return func(*arg, **kwargs)
[2023-03-28T21:51:42.737Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 660, in search
[2023-03-28T21:51:42.737Z]     res = conn.search(self._name, data, anns_field, param, limit, expr,
[2023-03-28T21:51:42.737Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler
[2023-03-28T21:51:42.737Z]     raise e
[2023-03-28T21:51:42.737Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler
[2023-03-28T21:51:42.737Z]     return func(*args, **kwargs)
[2023-03-28T21:51:42.737Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler
[2023-03-28T21:51:42.737Z]     ret = func(self, *args, **kwargs)
[2023-03-28T21:51:42.737Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler
[2023-03-28T21:51:42.737Z]     raise e
[2023-03-28T21:51:42.737Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler
[2023-03-28T21:51:42.737Z]     return func(self, *args, **kwargs)
[2023-03-28T21:51:42.737Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 518, in search
[2023-03-28T21:51:42.737Z]     return self._execute_search_requests(requests, timeout, round_decimal=round_decimal, auto_id=auto_id, **kwargs)
[2023-03-28T21:51:42.737Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 487, in _execute_search_requests
[2023-03-28T21:51:42.737Z]     raise pre_err
[2023-03-28T21:51:42.737Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 478, in _execute_search_requests
[2023-03-28T21:51:42.737Z]     raise MilvusException(response.status.error_code, response.status.reason)
[2023-03-28T21:51:42.737Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=All attempts results:
[2023-03-28T21:51:42.737Z] attempt #1:All attempts results:
[2023-03-28T21:51:42.737Z] attempt #1:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_23_440412283720634906v1 is not available in any replica, err=LackSegment(segmentID=440412283720634918)
[2023-03-28T21:51:42.737Z] attempt #2:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_23_440412283720634906v1 is not available in any replica, err=LackSegment(segmentID=440412283720634918)
[2023-03-28T21:51:42.737Z] attempt #3:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_23_440412283720634906v1 is not available in any replica, err=LackSegment(segmentID=440412283720634918)
[2023-03-28T21:51:42.737Z] attempt #4:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_23_440412283720634906v1 is not available in any replica, err=LackSegment(segmentID=440412283720634918)
[2023-03-28T21:51:42.737Z] attempt #5:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_23_440412283720634906v1 is not available in any replica, err=LackSegment(segmentID=440412283720634918)
[2023-03-28T21:51:42.737Z] attempt #6:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_23_440412283720634906v1 is not available in any replica, err=LackSegment(segmentID=440412283720634918)
[2023-03-28T21:51:42.737Z] attempt #7:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_23_440412283720634906v1 is not available in any replica, err=LackSegment(segmentID=440412283720634918)
[2023-03-28T21:51:42.737Z] attempt #8:context deadline exceeded
[2023-03-28T21:51:42.737Z] 
[2023-03-28T21:51:42.737Z] attempt #2:context canceled
[2023-03-28T21:51:42.737Z] )>
[2023-03-28T21:51:42.737Z]  (api_request.py:39)
[2023-03-28T21:51:42.737Z] [2023-03-28 21:51:22 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=fail to search on all shard leaders, err=All attempts results:
[2023-03-28T21:51:42.737Z] attempt #1:All attempts results:
[2023-03-28T21:51:42.737Z] attempt #1:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_23_440412283720634906v1 is not available in any replica, err=LackSegment(segmentID=440...... (api_request.py:40)
[2023-03-28T21:51:42.737Z] ------------- generated html file: file:///tmp/ci_logs/report.html -------------
[2023-03-28T21:51:42.737Z] =========================== short test summary info ============================
[2023-03-28T21:51:42.737Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__SH1BPjre] - AssertionError
[2023-03-28T21:51:42.737Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__3rfDrJLQ] - AssertionError
[2023-03-28T21:51:42.737Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__CNgyGo4N] - AssertionError
[2023-03-28T21:51:42.737Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__urkffQ60] - AssertionError
[2023-03-28T21:51:42.737Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__5afBSBEs] - AssertionError
[2023-03-28T21:51:42.737Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[DeleteChecker__BgfZJUyy] - AssertionError
[2023-03-28T21:51:42.737Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[SearchChecker__IkhBYrkr] - AssertionError
[2023-03-28T21:51:42.737Z] =================== 7 failed, 5 passed in 130.65s (0:02:10) ====================
script returned exit code 1

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-for-release-cron/detail/chaos-test-for-release-cron/3056/pipeline

log: artifacts-querynode-pod-kill-3056-server-logs.tar.gz

artifacts-querynode-pod-kill-3056-pytest-logs.tar.gz

weiliu1031 commented 1 year ago

root cause:

  1. query node: when shard leader watch node down, instead of remove segment on the node directly, it will mark the segment as offline state. and then pass to query coord by getDataDistribution
  2. query coord: when query coord found a segment in leader view and not in distribution, it won't told shard leader to sync its distribution, so the wrong distribution will keep forever.
  3. when get shard leader request came, it will check the leader view, and raise error with query node is not available.

more details: cause balance generate lots of high priority task, so normal load task has been blocked, and the expected leader view update won't be triggered for a period. and if you try query, here cames error with channel unavailable

weiliu1031 commented 1 year ago

please verify this. @zhuwenxing

weiliu1031 commented 1 year ago

/assign @zhuwenxing

zhuwenxing commented 1 year ago

It is not reproduced in 2.2.0 branch.