milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.06k stars 2.88k forks source link

[Bug]: Search failed with error `LackSegment(segmentID=441068488161569564): attempt #1: fail to get shard leaders from QueryCoord: channel xxx is not available in any replica` after querynode pod kill chaos test #23754

Closed zhuwenxing closed 1 year ago

zhuwenxing commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version:master-20230426-ed8836cd
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):pulsar    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

image

[2023-04-26T21:11:29.971Z] [2023-04-26 21:11:07 - INFO - ci_test]: [test][2023-04-26T21:11:07Z] [0.00371271s] Hello_Milvus load -> None (wrapper.py:30)

[2023-04-26T21:11:29.971Z] [2023-04-26 21:11:07 - DEBUG - ci_test]: (api_request)  : [Collection.search] args: [[[0.10290428248812877, 0.046151130751891235, 0.1240339121047413, 0.09801574109712922, 0.14083695746990374, 0.13548004776367595, 0.11941563618992829, 0.02435913895481577, 0.10848303656545288, 0.11847882688469828, 0.06903301837996625, 0.07372333228236763, 0.03558472022715604, 0.13711253162354337, 0.0......, kwargs: {} (api_request.py:56)

[2023-04-26T21:11:29.971Z] [2023-04-26 21:11:27 - ERROR - pymilvus.decorators]: RPC error: [search], <MilvusException: (code=1, message=fail to search on all shard leaders, err=attempt #0: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_441068488161569552v0 is not available in any replica, err=LackSegment(segmentID=441068488161569564): attempt #1: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_441068488161569552v0 is not available in any replica, err=LackSegment(segmentID=441068488161569564): attempt #2: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_441068488161569552v0 is not available in any replica, err=LackSegment(segmentID=441068488161569564): attempt #3: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_441068488161569552v0 is not available in any replica, err=LackSegment(segmentID=441068488161569564): attempt #4: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_441068488161569552v0 is not available in any replica, err=LackSegment(segmentID=441068488161569564): attempt #5: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_441068488161569552v0 is not available in any replica, err=LackSegment(segmentID=441068488161569564): attempt #6: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_441068488161569552v0 is not available in any replica, err=LackSegment(segmentID=441068488161569564): context done during sleep after run#6: context deadline exceeded)>, <Time:{'RPC start': '2023-04-26 21:11:07.558197', 'RPC error': '2023-04-26 21:11:27.562522'}> (decorators.py:108)

[2023-04-26T21:11:29.971Z] [2023-04-26 21:11:27 - ERROR - ci_test]: Traceback (most recent call last):

[2023-04-26T21:11:29.971Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 26, in inner_wrapper

[2023-04-26T21:11:29.971Z]     res = func(*args, **_kwargs)

[2023-04-26T21:11:29.971Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 57, in api_request

[2023-04-26T21:11:29.971Z]     return func(*arg, **kwargs)

[2023-04-26T21:11:29.971Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 660, in search

[2023-04-26T21:11:29.971Z]     res = conn.search(self._name, data, anns_field, param, limit, expr,

[2023-04-26T21:11:29.971Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler

[2023-04-26T21:11:29.971Z]     raise e

[2023-04-26T21:11:29.971Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler

[2023-04-26T21:11:29.971Z]     return func(*args, **kwargs)

[2023-04-26T21:11:29.971Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler

[2023-04-26T21:11:29.971Z]     ret = func(self, *args, **kwargs)

[2023-04-26T21:11:29.971Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler

[2023-04-26T21:11:29.971Z]     raise e

[2023-04-26T21:11:29.971Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler

[2023-04-26T21:11:29.971Z]     return func(self, *args, **kwargs)

[2023-04-26T21:11:29.971Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 518, in search

[2023-04-26T21:11:29.972Z]     return self._execute_search_requests(requests, timeout, round_decimal=round_decimal, auto_id=auto_id, **kwargs)

[2023-04-26T21:11:29.972Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 487, in _execute_search_requests

[2023-04-26T21:11:29.972Z]     raise pre_err

[2023-04-26T21:11:29.972Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 478, in _execute_search_requests

[2023-04-26T21:11:29.972Z]     raise MilvusException(response.status.error_code, response.status.reason)

[2023-04-26T21:11:29.972Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=attempt #0: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_441068488161569552v0 is not available in any replica, err=LackSegment(segmentID=441068488161569564): attempt #1: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_441068488161569552v0 is not available in any replica, err=LackSegment(segmentID=441068488161569564): attempt #2: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_441068488161569552v0 is not available in any replica, err=LackSegment(segmentID=441068488161569564): attempt #3: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_441068488161569552v0 is not available in any replica, err=LackSegment(segmentID=441068488161569564): attempt #4: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_441068488161569552v0 is not available in any replica, err=LackSegment(segmentID=441068488161569564): attempt #5: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_441068488161569552v0 is not available in any replica, err=LackSegment(segmentID=441068488161569564): attempt #6: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_441068488161569552v0 is not available in any replica, err=LackSegment(segmentID=441068488161569564): context done during sleep after run#6: context deadline exceeded)>

[2023-04-26T21:11:29.972Z]  (api_request.py:39)

[2023-04-26T21:11:29.972Z] [2023-04-26 21:11:27 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=fail to search on all shard leaders, err=attempt #0: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_2_441068488161569552v0 is not available in any replica, err=LackSegment(segmentID=441068488161569564): attempt #1: fail to get shard leaders...... (api_request.py:40)

[2023-04-26T21:11:29.972Z] ------------- generated html file: file:///tmp/ci_logs/report.html -------------

[2023-04-26T21:11:29.972Z] =========================== short test summary info ============================

[2023-04-26T21:11:29.972Z] FAILED testcases/test_data_persistence.py::TestDataPersistence::test_milvus_default - AssertionError

[2023-04-26T21:11:29.972Z] ============================== 1 failed in 23.84s ==============================
[2023-04-26T21:14:52.857Z] [2023-04-26 21:13:28 - DEBUG - ci_test]: (api_request)  : [Collection.load] args: [None, 1, 120], kwargs: {} (api_request.py:56)

[2023-04-26T21:14:52.857Z] [2023-04-26 21:13:28 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)

[2023-04-26T21:14:52.857Z] [2023-04-26 21:13:28 - INFO - ci_test]: [test][2023-04-26T21:13:28Z] [0.00421599s] QueryChecker__Rr2z5zw7 load -> None (wrapper.py:30)

[2023-04-26T21:14:52.857Z] [2023-04-26 21:13:28 - DEBUG - ci_test]: (api_request)  : [Collection.search] args: [[[0.0892254549815248, 0.11904628322607419, 0.14514873691962338, 0.008512621810735449, 0.04888922894785318, 0.06777594059967976, 0.10960347448333996, 0.15561781224250007, 0.10189917425479858, 0.0045636031773030016, 0.14028805800683453, 0.08016779130237615, 0.032170005055591186, 0.03606567818555394, ......, kwargs: {} (api_request.py:56)

[2023-04-26T21:14:52.858Z] [2023-04-26 21:13:48 - ERROR - pymilvus.decorators]: RPC error: [search], <MilvusException: (code=1, message=fail to search on all shard leaders, err=attempt #0: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_25_441068488161570802v1 is not available in any replica, err=LackSegment(segmentID=441068488161570823): attempt #1: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_25_441068488161570802v1 is not available in any replica, err=LackSegment(segmentID=441068488161570823): attempt #2: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_25_441068488161570802v1 is not available in any replica, err=LackSegment(segmentID=441068488161570823): attempt #3: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_25_441068488161570802v1 is not available in any replica, err=LackSegment(segmentID=441068488161570823): attempt #4: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_25_441068488161570802v1 is not available in any replica, err=LackSegment(segmentID=441068488161570823): attempt #5: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_25_441068488161570802v1 is not available in any replica, err=LackSegment(segmentID=441068488161570823): attempt #6: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_25_441068488161570802v1 is not available in any replica, err=LackSegment(segmentID=441068488161570823): context done during sleep after run#6: context deadline exceeded)>, <Time:{'RPC start': '2023-04-26 21:13:28.533876', 'RPC error': '2023-04-26 21:13:48.538158'}> (decorators.py:108)

[2023-04-26T21:14:52.858Z] [2023-04-26 21:13:48 - ERROR - ci_test]: Traceback (most recent call last):

[2023-04-26T21:14:52.858Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 26, in inner_wrapper

[2023-04-26T21:14:52.858Z]     res = func(*args, **_kwargs)

[2023-04-26T21:14:52.858Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 57, in api_request

[2023-04-26T21:14:52.858Z]     return func(*arg, **kwargs)

[2023-04-26T21:14:52.858Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 660, in search

[2023-04-26T21:14:52.858Z]     res = conn.search(self._name, data, anns_field, param, limit, expr,

[2023-04-26T21:14:52.858Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler

[2023-04-26T21:14:52.858Z]     raise e

[2023-04-26T21:14:52.858Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler

[2023-04-26T21:14:52.858Z]     return func(*args, **kwargs)

[2023-04-26T21:14:52.858Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler

[2023-04-26T21:14:52.858Z]     ret = func(self, *args, **kwargs)

[2023-04-26T21:14:52.858Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler

[2023-04-26T21:14:52.858Z]     raise e

[2023-04-26T21:14:52.858Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler

[2023-04-26T21:14:52.858Z]     return func(self, *args, **kwargs)

[2023-04-26T21:14:52.858Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 518, in search

[2023-04-26T21:14:52.858Z]     return self._execute_search_requests(requests, timeout, round_decimal=round_decimal, auto_id=auto_id, **kwargs)

[2023-04-26T21:14:52.858Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 487, in _execute_search_requests

[2023-04-26T21:14:52.858Z]     raise pre_err

[2023-04-26T21:14:52.858Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 478, in _execute_search_requests

[2023-04-26T21:14:52.858Z]     raise MilvusException(response.status.error_code, response.status.reason)

[2023-04-26T21:14:52.858Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=attempt #0: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_25_441068488161570802v1 is not available in any replica, err=LackSegment(segmentID=441068488161570823): attempt #1: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_25_441068488161570802v1 is not available in any replica, err=LackSegment(segmentID=441068488161570823): attempt #2: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_25_441068488161570802v1 is not available in any replica, err=LackSegment(segmentID=441068488161570823): attempt #3: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_25_441068488161570802v1 is not available in any replica, err=LackSegment(segmentID=441068488161570823): attempt #4: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_25_441068488161570802v1 is not available in any replica, err=LackSegment(segmentID=441068488161570823): attempt #5: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_25_441068488161570802v1 is not available in any replica, err=LackSegment(segmentID=441068488161570823): attempt #6: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_25_441068488161570802v1 is not available in any replica, err=LackSegment(segmentID=441068488161570823): context done during sleep after run#6: context deadline exceeded)>

[2023-04-26T21:14:52.858Z]  (api_request.py:39)

[2023-04-26T21:14:52.858Z] [2023-04-26 21:13:48 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=fail to search on all shard leaders, err=attempt #0: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_25_441068488161570802v1 is not available in any replica, err=LackSegment(segmentID=441068488161570823): attempt #1: fail to get shard leader...... (api_request.py:40)

[2023-04-26T21:14:52.858Z] ------------- generated html file: file:///tmp/ci_logs/report.html -------------

[2023-04-26T21:14:52.858Z] =========================== short test summary info ============================

[2023-04-26T21:14:52.858Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__1Fi7O87s] - AssertionError

[2023-04-26T21:14:52.858Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__bzgD0nP8] - AssertionError

[2023-04-26T21:14:52.858Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__JHur2GIK] - AssertionError

[2023-04-26T21:14:52.858Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[SearchChecker__30VGtZGc] - AssertionError

[2023-04-26T21:14:52.858Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__t22veAXb] - AssertionError

[2023-04-26T21:14:52.858Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[QueryChecker__Rr2z5zw7] - AssertionError

[2023-04-26T21:14:52.858Z] =================== 6 failed, 6 passed in 216.69s (0:03:36) ====================

Expected Behavior

all test cases passed

Steps To Reproduce

No response

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/3766/pipeline log:

artifacts-querynode-pod-kill-3766-server-logs.tar.gz

artifacts-querynode-pod-kill-3766-pytest-logs.tar.gz

Anything else?

No response

yanliang567 commented 1 year ago

/assign @sunby /unassign

zhuwenxing commented 1 year ago

image

It was not reproduced in master-20230510-0c99399f