Closed zhuwenxing closed 1 year ago
It is a stable reproduced issue when mq is pulsar. But when the mq is Kafka, this issue is not reproduced.
chaos type: pod-kill image tag: 2.2.0-20230317-bbc21fe8 target pod: querynode https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-for-release-cron/detail/chaos-test-for-release-cron/2800/pipeline log: artifacts-querynode-pod-kill-2800-server-logs.tar.gz artifacts-querynode-pod-kill-2800-pytest-logs.tar.gz
/assign @jiaoew1991 /unassign
/assign @weiliu1031 /unassign
root cause:
It was still reproduced in 2.2.5-20230328-0c729b3c
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:09 - INFO - ci_test]: [test][2023-03-28T21:51:09Z] [0.00418141s] SearchChecker__IkhBYrkr flush -> None (wrapper.py:30)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:09 - INFO - ci_test]: assert create collection: 0.008777141571044922, init_entities: 3000 (test_all_collections_after_chaos.py:41)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:09 - DEBUG - ci_test]: (api_request) : [Collection.insert] args: [[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,......, kwargs: {'timeout': 120} (api_request.py:56)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:09 - DEBUG - ci_test]: (api_response) : (insert count: 3000, delete count: 0, upsert count: 0, timestamp: 440412476506374146, success count: 3000, err count: 0) (api_request.py:31)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:09 - INFO - ci_test]: [test][2023-03-28T21:51:09Z] [0.33110637s] SearchChecker__IkhBYrkr insert -> (insert count: 3000, delete count: 0, upsert count: 0, timestamp: 440412476506374146, success count: 3000, err count: 0) (wrapper.py:30)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:09 - INFO - ci_test]: assert insert: 0.33153223991394043 (test_all_collections_after_chaos.py:48)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:09 - DEBUG - ci_test]: (api_request) : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:56)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:12 - DEBUG - ci_test]: (api_response) : None (api_request.py:31)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:12 - INFO - ci_test]: [test][2023-03-28T21:51:09Z] [2.51892531s] SearchChecker__IkhBYrkr flush -> None (wrapper.py:30)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:12 - DEBUG - ci_test]: (api_request) : [Collection.flush] args: [], kwargs: {'timeout': 120} (api_request.py:56)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:12 - DEBUG - ci_test]: (api_response) : None (api_request.py:31)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:12 - INFO - ci_test]: [test][2023-03-28T21:51:12Z] [0.00451105s] SearchChecker__IkhBYrkr flush -> None (wrapper.py:30)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:12 - DEBUG - ci_test]: (api_request) : [Collection.flush] args: [], kwargs: {'timeout': 120} (api_request.py:56)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:12 - DEBUG - ci_test]: (api_response) : None (api_request.py:31)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:12 - INFO - ci_test]: [test][2023-03-28T21:51:12Z] [0.00556729s] SearchChecker__IkhBYrkr flush -> None (wrapper.py:30)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:12 - INFO - ci_test]: assert flush: 2.526207685470581, entities: 6000 (test_all_collections_after_chaos.py:58)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:12 - INFO - ci_test]: index info: [{'collection': 'SearchChecker__IkhBYrkr', 'field': 'float_vector', 'index_name': 'index__5HYIWMzk', 'index_param': {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 48, 'efConstruction': 500}}}] (test_all_collections_after_chaos.py:74)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:12 - DEBUG - ci_test]: (api_request) : [Collection.load] args: [None, 1, 120], kwargs: {} (api_request.py:56)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:12 - DEBUG - ci_test]: (api_response) : None (api_request.py:31)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:12 - INFO - ci_test]: [test][2023-03-28T21:51:12Z] [0.00455358s] SearchChecker__IkhBYrkr load -> None (wrapper.py:30)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:12 - DEBUG - ci_test]: (api_request) : [Collection.search] args: [[[0.04400977043543584, 0.08573560469596708, 0.12294614678739231, 0.02797168480160405, 0.04778780117876424, 0.08090326769733451, 0.13277738449640375, 0.021937854194746264, 0.11970723408568698, 0.0921315321838237, 0.07324508852649558, 0.05507059532700562, 0.14186892821346925, 0.046999774639943, 0.059......, kwargs: {} (api_request.py:56)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:22 - ERROR - pymilvus.decorators]: RPC error: [search], <MilvusException: (code=1, message=fail to search on all shard leaders, err=All attempts results:
[2023-03-28T21:51:42.736Z] attempt #1:All attempts results:
[2023-03-28T21:51:42.736Z] attempt #1:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_23_440412283720634906v1 is not available in any replica, err=LackSegment(segmentID=440412283720634918)
[2023-03-28T21:51:42.736Z] attempt #2:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_23_440412283720634906v1 is not available in any replica, err=LackSegment(segmentID=440412283720634918)
[2023-03-28T21:51:42.736Z] attempt #3:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_23_440412283720634906v1 is not available in any replica, err=LackSegment(segmentID=440412283720634918)
[2023-03-28T21:51:42.736Z] attempt #4:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_23_440412283720634906v1 is not available in any replica, err=LackSegment(segmentID=440412283720634918)
[2023-03-28T21:51:42.736Z] attempt #5:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_23_440412283720634906v1 is not available in any replica, err=LackSegment(segmentID=440412283720634918)
[2023-03-28T21:51:42.736Z] attempt #6:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_23_440412283720634906v1 is not available in any replica, err=LackSegment(segmentID=440412283720634918)
[2023-03-28T21:51:42.736Z] attempt #7:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_23_440412283720634906v1 is not available in any replica, err=LackSegment(segmentID=440412283720634918)
[2023-03-28T21:51:42.736Z] attempt #8:context deadline exceeded
[2023-03-28T21:51:42.736Z]
[2023-03-28T21:51:42.736Z] attempt #2:context canceled
[2023-03-28T21:51:42.736Z] )>, <Time:{'RPC start': '2023-03-28 21:51:12.479884', 'RPC error': '2023-03-28 21:51:22.482701'}> (decorators.py:108)
[2023-03-28T21:51:42.736Z] [2023-03-28 21:51:22 - ERROR - ci_test]: Traceback (most recent call last):
[2023-03-28T21:51:42.736Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 26, in inner_wrapper
[2023-03-28T21:51:42.736Z] res = func(*args, **_kwargs)
[2023-03-28T21:51:42.737Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 57, in api_request
[2023-03-28T21:51:42.737Z] return func(*arg, **kwargs)
[2023-03-28T21:51:42.737Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 660, in search
[2023-03-28T21:51:42.737Z] res = conn.search(self._name, data, anns_field, param, limit, expr,
[2023-03-28T21:51:42.737Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler
[2023-03-28T21:51:42.737Z] raise e
[2023-03-28T21:51:42.737Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler
[2023-03-28T21:51:42.737Z] return func(*args, **kwargs)
[2023-03-28T21:51:42.737Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler
[2023-03-28T21:51:42.737Z] ret = func(self, *args, **kwargs)
[2023-03-28T21:51:42.737Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler
[2023-03-28T21:51:42.737Z] raise e
[2023-03-28T21:51:42.737Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler
[2023-03-28T21:51:42.737Z] return func(self, *args, **kwargs)
[2023-03-28T21:51:42.737Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 518, in search
[2023-03-28T21:51:42.737Z] return self._execute_search_requests(requests, timeout, round_decimal=round_decimal, auto_id=auto_id, **kwargs)
[2023-03-28T21:51:42.737Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 487, in _execute_search_requests
[2023-03-28T21:51:42.737Z] raise pre_err
[2023-03-28T21:51:42.737Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 478, in _execute_search_requests
[2023-03-28T21:51:42.737Z] raise MilvusException(response.status.error_code, response.status.reason)
[2023-03-28T21:51:42.737Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=All attempts results:
[2023-03-28T21:51:42.737Z] attempt #1:All attempts results:
[2023-03-28T21:51:42.737Z] attempt #1:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_23_440412283720634906v1 is not available in any replica, err=LackSegment(segmentID=440412283720634918)
[2023-03-28T21:51:42.737Z] attempt #2:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_23_440412283720634906v1 is not available in any replica, err=LackSegment(segmentID=440412283720634918)
[2023-03-28T21:51:42.737Z] attempt #3:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_23_440412283720634906v1 is not available in any replica, err=LackSegment(segmentID=440412283720634918)
[2023-03-28T21:51:42.737Z] attempt #4:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_23_440412283720634906v1 is not available in any replica, err=LackSegment(segmentID=440412283720634918)
[2023-03-28T21:51:42.737Z] attempt #5:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_23_440412283720634906v1 is not available in any replica, err=LackSegment(segmentID=440412283720634918)
[2023-03-28T21:51:42.737Z] attempt #6:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_23_440412283720634906v1 is not available in any replica, err=LackSegment(segmentID=440412283720634918)
[2023-03-28T21:51:42.737Z] attempt #7:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_23_440412283720634906v1 is not available in any replica, err=LackSegment(segmentID=440412283720634918)
[2023-03-28T21:51:42.737Z] attempt #8:context deadline exceeded
[2023-03-28T21:51:42.737Z]
[2023-03-28T21:51:42.737Z] attempt #2:context canceled
[2023-03-28T21:51:42.737Z] )>
[2023-03-28T21:51:42.737Z] (api_request.py:39)
[2023-03-28T21:51:42.737Z] [2023-03-28 21:51:22 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=fail to search on all shard leaders, err=All attempts results:
[2023-03-28T21:51:42.737Z] attempt #1:All attempts results:
[2023-03-28T21:51:42.737Z] attempt #1:fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_23_440412283720634906v1 is not available in any replica, err=LackSegment(segmentID=440...... (api_request.py:40)
[2023-03-28T21:51:42.737Z] ------------- generated html file: file:///tmp/ci_logs/report.html -------------
[2023-03-28T21:51:42.737Z] =========================== short test summary info ============================
[2023-03-28T21:51:42.737Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__SH1BPjre] - AssertionError
[2023-03-28T21:51:42.737Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__3rfDrJLQ] - AssertionError
[2023-03-28T21:51:42.737Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__CNgyGo4N] - AssertionError
[2023-03-28T21:51:42.737Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__urkffQ60] - AssertionError
[2023-03-28T21:51:42.737Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__5afBSBEs] - AssertionError
[2023-03-28T21:51:42.737Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[DeleteChecker__BgfZJUyy] - AssertionError
[2023-03-28T21:51:42.737Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[SearchChecker__IkhBYrkr] - AssertionError
[2023-03-28T21:51:42.737Z] =================== 7 failed, 5 passed in 130.65s (0:02:10) ====================
script returned exit code 1
root cause:
- query node: when shard leader watch node down, instead of remove segment on the node directly, it will mark the segment as offline state. and then pass to query coord by getDataDistribution
- query coord: when query coord found a segment in leader view and not in distribution, it won't told shard leader to sync its distribution, so the wrong distribution will keep forever.
- when get shard leader request came, it will check the leader view, and raise error with query node is not available.
more details: cause balance generate lots of high priority task, so normal load task has been blocked, and the expected leader view update won't be triggered for a period. and if you try query, here cames error with channel unavailable
please verify this. @zhuwenxing
/assign @zhuwenxing
It is not reproduced in 2.2.0 branch.
Is there an existing issue for this?
Environment
Current Behavior
Expected Behavior
all test cases passed
Steps To Reproduce
No response
Milvus Log
failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-for-release-cron/detail/chaos-test-for-release-cron/2848/pipeline log:
artifacts-querynode-pod-kill-2848-server-logs.tar.gz
artifacts-querynode-pod-kill-2848-pytest-logs.tar.gz
Anything else?
No response