milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.07k stars 2.88k forks source link

[Bug]: Search failed with error `fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_xxxv1 is not available in any replica, err=NodeOffline(nodeID=14)` after querynode pod kill chaos test #25303

Closed zhuwenxing closed 1 year ago

zhuwenxing commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version: master-20230703-b68fa204
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):pulsar    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

[2023-07-03T21:28:26.808Z] [2023-07-03 21:25:51 - DEBUG - ci_test]: (api_request)  : [Collection.load] args: [None, 1, 120], kwargs: {} (api_request.py:56)

[2023-07-03T21:28:26.808Z] [2023-07-03 21:25:51 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)

[2023-07-03T21:28:26.808Z] [2023-07-03 21:25:51 - DEBUG - ci_test]: (api_request)  : [Collection.search] args: [[[0.059717549997167094, 0.14877690933562582, 0.1489620534715377, 0.0646615772591875, 0.08848653409007677, 0.10177284678587688, 0.009483000087463788, 0.11871770799475921, 0.09544176682722542, 0.1317445061329348, 0.022889924940971623, 0.003898804375213599, 0.1302975267994167, 0.05157072989060568, 0.0......, kwargs: {} (api_request.py:56)

[2023-07-03T21:28:26.808Z] [2023-07-03 21:26:01 - ERROR - pymilvus.decorators]: RPC error: [search], <MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_13_442608860478446661v1 is not available in any replica, err=NodeOffline(nodeID=14): attempt #1: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_13_442608860478446661v1 is not available in any replica, err=NodeOffline(nodeID=14): attempt #2: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_13_442608860478446661v1 is not available in any replica, err=NodeOffline(nodeID=14): attempt #3: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_13_442608860478446661v1 is not available in any replica, err=NodeOffline(nodeID=14): attempt #4: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_13_442608860478446661v1 is not available in any replica, err=NodeOffline(nodeID=14): attempt #5: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_13_442608860478446661v1 is not available in any replica, err=NodeOffline(nodeID=14): attempt #6: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_13_442608860478446661v1 is not available in any replica, err=NodeOffline(nodeID=14): context done during sleep after run#6: context deadline exceeded: fail to search on all shard leaders)>, <Time:{'RPC start': '2023-07-03 21:25:51.260717', 'RPC error': '2023-07-03 21:26:01.262790'}> (decorators.py:108)

[2023-07-03T21:28:26.808Z] [2023-07-03 21:26:01 - ERROR - ci_test]: Traceback (most recent call last):

[2023-07-03T21:28:26.808Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 26, in inner_wrapper

[2023-07-03T21:28:26.808Z]     res = func(*args, **_kwargs)

[2023-07-03T21:28:26.808Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 57, in api_request

[2023-07-03T21:28:26.808Z]     return func(*arg, **kwargs)

[2023-07-03T21:28:26.808Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 682, in search

[2023-07-03T21:28:26.808Z]     res = conn.search(self._name, data, anns_field, param, limit, expr,

[2023-07-03T21:28:26.808Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler

[2023-07-03T21:28:26.808Z]     raise e

[2023-07-03T21:28:26.808Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler

[2023-07-03T21:28:26.808Z]     return func(*args, **kwargs)

[2023-07-03T21:28:26.808Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler

[2023-07-03T21:28:26.808Z]     ret = func(self, *args, **kwargs)

[2023-07-03T21:28:26.808Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler

[2023-07-03T21:28:26.808Z]     raise e

[2023-07-03T21:28:26.808Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler

[2023-07-03T21:28:26.808Z]     return func(self, *args, **kwargs)

[2023-07-03T21:28:26.808Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 599, in search

[2023-07-03T21:28:26.808Z]     return self._execute_search_requests(requests, timeout, round_decimal=round_decimal, **kwargs)

[2023-07-03T21:28:26.808Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 579, in _execute_search_requests

[2023-07-03T21:28:26.808Z]     raise pre_err

[2023-07-03T21:28:26.808Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 570, in _execute_search_requests

[2023-07-03T21:28:26.808Z]     raise MilvusException(response.status.error_code, response.status.reason)

[2023-07-03T21:28:26.808Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_13_442608860478446661v1 is not available in any replica, err=NodeOffline(nodeID=14): attempt #1: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_13_442608860478446661v1 is not available in any replica, err=NodeOffline(nodeID=14): attempt #2: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_13_442608860478446661v1 is not available in any replica, err=NodeOffline(nodeID=14): attempt #3: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_13_442608860478446661v1 is not available in any replica, err=NodeOffline(nodeID=14): attempt #4: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_13_442608860478446661v1 is not available in any replica, err=NodeOffline(nodeID=14): attempt #5: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_13_442608860478446661v1 is not available in any replica, err=NodeOffline(nodeID=14): attempt #6: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_13_442608860478446661v1 is not available in any replica, err=NodeOffline(nodeID=14): context done during sleep after run#6: context deadline exceeded: fail to search on all shard leaders)>

[2023-07-03T21:28:26.808Z]  (api_request.py:39)

[2023-07-03T21:28:26.808Z] [2023-07-03 21:26:01 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=attempt #0: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_13_442608860478446661v1 is not available in any replica, err=NodeOffline(nodeID=14): attempt #1: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_13_442608860...... (api_request.py:40)

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/5481/pipeline

log:

artifacts-querynode-pod-kill-5481-server-logs (1).tar.gz

artifacts-querynode-pod-kill-5481-pytest-logs (1).tar.gz

Anything else?

No response

yanliang567 commented 1 year ago

/assign @jiaoew1991 /unassign

jiaoew1991 commented 1 year ago

/assign @smellthemoon /unassign

smellthemoon commented 1 year ago

may related with #25393

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.