milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.07k stars 2.88k forks source link

[Bug]: Querynode restarted and search failed with error `fail to get shard leaders from QueryCoord: channel xxx is not available in any replica, err=LackSegment(segmentID=xxx)` after many chaos test #23085

Closed zhuwenxing closed 1 year ago

zhuwenxing commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version:master-20230328-dc6d4b91
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka): pulsar and kafka   
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

[2023-03-28T18:54:09.692Z] [2023-03-28 18:53:27 - INFO - ci_test]: assert insert: 0.34269094467163086 (test_all_collections_after_chaos.py:48)

[2023-03-28T18:54:09.692Z] [2023-03-28 18:53:27 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:56)

[2023-03-28T18:54:09.692Z] [2023-03-28 18:53:30 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)

[2023-03-28T18:54:09.692Z] [2023-03-28 18:53:30 - INFO - ci_test]: [test][2023-03-28T18:53:27Z] [3.01850421s] DeleteChecker__Nwzfdv3E flush -> None (wrapper.py:30)

[2023-03-28T18:54:09.692Z] [2023-03-28 18:53:30 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 120} (api_request.py:56)

[2023-03-28T18:54:09.692Z] [2023-03-28 18:53:30 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)

[2023-03-28T18:54:09.692Z] [2023-03-28 18:53:30 - INFO - ci_test]: [test][2023-03-28T18:53:30Z] [0.00299978s] DeleteChecker__Nwzfdv3E flush -> None (wrapper.py:30)

[2023-03-28T18:54:09.692Z] [2023-03-28 18:53:30 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 120} (api_request.py:56)

[2023-03-28T18:54:09.692Z] [2023-03-28 18:53:30 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)

[2023-03-28T18:54:09.692Z] [2023-03-28 18:53:30 - INFO - ci_test]: [test][2023-03-28T18:53:30Z] [0.00273693s] DeleteChecker__Nwzfdv3E flush -> None (wrapper.py:30)

[2023-03-28T18:54:09.692Z] [2023-03-28 18:53:30 - INFO - ci_test]: assert flush: 3.0231077671051025, entities: 6000 (test_all_collections_after_chaos.py:58)

[2023-03-28T18:54:09.692Z] [2023-03-28 18:53:30 - INFO - ci_test]: index info: [{'collection': 'DeleteChecker__Nwzfdv3E', 'field': 'float_vector', 'index_name': 'index__ycrxGX7I', 'index_param': {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 48, 'efConstruction': 500}}}] (test_all_collections_after_chaos.py:74)

[2023-03-28T18:54:09.692Z] [2023-03-28 18:53:30 - DEBUG - ci_test]: (api_request)  : [Collection.load] args: [None, 1, 120], kwargs: {} (api_request.py:56)

[2023-03-28T18:54:09.692Z] [2023-03-28 18:53:30 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)

[2023-03-28T18:54:09.692Z] [2023-03-28 18:53:30 - INFO - ci_test]: [test][2023-03-28T18:53:30Z] [0.00340504s] DeleteChecker__Nwzfdv3E load -> None (wrapper.py:30)

[2023-03-28T18:54:09.692Z] [2023-03-28 18:53:30 - DEBUG - ci_test]: (api_request)  : [Collection.search] args: [[[0.10784767596574515, 0.05716483509585237, 0.0212005033119737, 0.1201219592266927, 0.033323610126605674, 0.15803558535206888, 0.022157368636657893, 0.06177099366547712, 0.15505704462386524, 0.016775877910088017, 0.102997712578513, 0.1267593801562243, 0.005367321151836398, 0.04593776455799438, 0.14......, kwargs: {} (api_request.py:56)

[2023-03-28T18:54:09.693Z] [2023-03-28 18:53:50 - ERROR - pymilvus.decorators]: RPC error: [search], <MilvusException: (code=1, message=fail to search on all shard leaders, err=attempt #0: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_27_440409479125741517v1 is not available in any replica, err=LackSegment(segmentID=440409479125741558): attempt #1: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_27_440409479125741517v1 is not available in any replica, err=LackSegment(segmentID=440409479125741558): attempt #2: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_27_440409479125741517v1 is not available in any replica, err=LackSegment(segmentID=440409479125741558): attempt #3: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_27_440409479125741517v1 is not available in any replica, err=LackSegment(segmentID=440409479125741558): attempt #4: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_26_440409479125741517v0 is not available in any replica, err=NodeHeartbeatOutdated(nodeID=18, lastHeartbeat=2023-03-28 18:53:33.533994806 +0000 UTC): attempt #5: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_26_440409479125741517v0 is not available in any replica, err=NodeHeartbeatOutdated(nodeID=18, lastHeartbeat=2023-03-28 18:53:33.533994806 +0000 UTC): attempt #6: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_26_440409479125741517v0 is not available in any replica, err=NodeHeartbeatOutdated(nodeID=18, lastHeartbeat=2023-03-28 18:53:33.533994806 +0000 UTC): context done during sleep after run#6: context deadline exceeded)>, <Time:{'RPC start': '2023-03-28 18:53:30.674369', 'RPC error': '2023-03-28 18:53:50.677532'}> (decorators.py:108)

[2023-03-28T18:54:09.693Z] [2023-03-28 18:53:50 - ERROR - ci_test]: Traceback (most recent call last):

[2023-03-28T18:54:09.693Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 26, in inner_wrapper

[2023-03-28T18:54:09.693Z]     res = func(*args, **_kwargs)

[2023-03-28T18:54:09.693Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 57, in api_request

[2023-03-28T18:54:09.693Z]     return func(*arg, **kwargs)

[2023-03-28T18:54:09.693Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 660, in search

[2023-03-28T18:54:09.693Z]     res = conn.search(self._name, data, anns_field, param, limit, expr,

[2023-03-28T18:54:09.693Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler

[2023-03-28T18:54:09.693Z]     raise e

[2023-03-28T18:54:09.693Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler

[2023-03-28T18:54:09.693Z]     return func(*args, **kwargs)

[2023-03-28T18:54:09.693Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler

[2023-03-28T18:54:09.693Z]     ret = func(self, *args, **kwargs)

[2023-03-28T18:54:09.693Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler

[2023-03-28T18:54:09.693Z]     raise e

[2023-03-28T18:54:09.693Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler

[2023-03-28T18:54:09.693Z]     return func(self, *args, **kwargs)

[2023-03-28T18:54:09.693Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 518, in search

[2023-03-28T18:54:09.693Z]     return self._execute_search_requests(requests, timeout, round_decimal=round_decimal, auto_id=auto_id, **kwargs)

[2023-03-28T18:54:09.693Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 487, in _execute_search_requests

[2023-03-28T18:54:09.693Z]     raise pre_err

[2023-03-28T18:54:09.693Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 478, in _execute_search_requests

[2023-03-28T18:54:09.693Z]     raise MilvusException(response.status.error_code, response.status.reason)

[2023-03-28T18:54:09.693Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=attempt #0: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_27_440409479125741517v1 is not available in any replica, err=LackSegment(segmentID=440409479125741558): attempt #1: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_27_440409479125741517v1 is not available in any replica, err=LackSegment(segmentID=440409479125741558): attempt #2: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_27_440409479125741517v1 is not available in any replica, err=LackSegment(segmentID=440409479125741558): attempt #3: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_27_440409479125741517v1 is not available in any replica, err=LackSegment(segmentID=440409479125741558): attempt #4: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_26_440409479125741517v0 is not available in any replica, err=NodeHeartbeatOutdated(nodeID=18, lastHeartbeat=2023-03-28 18:53:33.533994806 +0000 UTC): attempt #5: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_26_440409479125741517v0 is not available in any replica, err=NodeHeartbeatOutdated(nodeID=18, lastHeartbeat=2023-03-28 18:53:33.533994806 +0000 UTC): attempt #6: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_26_440409479125741517v0 is not available in any replica, err=NodeHeartbeatOutdated(nodeID=18, lastHeartbeat=2023-03-28 18:53:33.533994806 +0000 UTC): context done during sleep after run#6: context deadline exceeded)>

[2023-03-28T18:54:09.693Z]  (api_request.py:39)

[2023-03-28T18:54:09.693Z] [2023-03-28 18:53:50 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=fail to search on all shard leaders, err=attempt #0: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_27_440409479125741517v1 is not available in any replica, err=LackSegment(segmentID=440409479125741558): attempt #1: fail to get shard leader...... (api_request.py:40)

[2023-03-28T18:54:09.693Z] ------------- generated html file: file:///tmp/ci_logs/report.html -------------

[2023-03-28T18:54:09.693Z] =========================== short test summary info ============================

[2023-03-28T18:54:09.693Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[QueryChecker__XaMNsuqE] - AssertionError

[2023-03-28T18:54:09.693Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__ylTv5imQ] - AssertionError

[2023-03-28T18:54:09.693Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__Jw6RS67E] - AssertionError

[2023-03-28T18:54:09.693Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[SearchChecker__ZdE97oOC] - AssertionError

[2023-03-28T18:54:09.693Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[DeleteChecker__Nwzfdv3E] - AssertionError

[2023-03-28T18:54:09.693Z] =================== 5 failed, 8 passed in 126.13s (0:02:06) ====================

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

chaos type: pod-kill image tag: master-20230328-dc6d4b91 target pod: querynode failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-cron/detail/chaos-test-kafka-cron/3015/pipeline log: artifacts-querynode-pod-kill-3015-server-logs.tar.gz

artifacts-querynode-pod-kill-3015-pytest-logs.tar.gz

Anything else?

No response

MrPresent-Han commented 1 year ago

I will have a look at this

zhuwenxing commented 1 year ago

The querynode restarted during testing and made this error.

zhuwenxing commented 1 year ago

image

[2023/03/28 18:53:33.602 +00:00] [INFO] [querynodev2/services.go:370] ["received load segments request"] [traceID=71f28bb4ff443123bcc2907ee43bb34e] [collectionID=440409479125740917] [partitionID=440409479125740918] [shard=by-dev-rootcoord-dml_21_440409479125740917v1] [segmentID=440409479125541945] [version=0] [needTransfer=false]
[libprotobuf ERROR /home/conan/w/prod/BuildSingleReference/.conan/data/protobuf/3.21.4/_/_/build/7627fae1426bcc12a67dba7c7207b1bccf05e5fd/src/src/google/protobuf/text_format.cc:337] Error parsing text-format milvus.proto.schema.CollectionSchema: 1:1: Expected identifier, got: <
unmarshal schema string failed
Assert "schema->get_primary_field_id().has_value()" at /go/src/github.com/milvus-io/milvus/internal/core/src/common/Schema.cpp:94
 => primary key should be specified
terminate called after throwing an instance of 'milvus::SegcoreError'
  what():  Assert "schema->get_primary_field_id().has_value()" at /go/src/github.com/milvus-io/milvus/internal/core/src/common/Schema.cpp:94
 => primary key should be specified
SIGABRT: abort
PC=0x7fa66727100b m=42 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 33938 [syscall]:
non-Go function
    pc=0x7fa66727100b
non-Go function
    pc=0x7fa667250858
non-Go function
    pc=0x7fa6670e8910
non-Go function
    pc=0x7fa6670f438b
non-Go function
    pc=0x7fa6670f43f6
non-Go function
    pc=0x7fa6670f46a8
non-Go function
    pc=0x7fa667d0d6f0
non-Go function
    pc=0x7fa667b19555
runtime.cgocall(0x3007be0, 0xc002d05080)
    /usr/local/go/src/runtime/cgocall.go:157 +0x5c fp=0xc002d05058 sp=0xc002d05020 pc=0x1278b7c
github.com/milvus-io/milvus/internal/querynodev2/segments._Cfunc_NewCollection(0x7fa3f7e08000)
MrPresent-Han commented 1 year ago

/assign MrPresent-Han

yanliang567 commented 1 year ago

@MrPresent-Han any updates?

MrPresent-Han commented 1 year ago

the pr above should have fixed this problem, did it reproduce again or can we close this issue? @zhuwenxing

zhuwenxing commented 1 year ago

with image master-20230515-836b862d, the search still failed, but querynode will not restart

[2023-05-15T19:01:56.070Z] [2023-05-15 18:58:30 - INFO - ci_test]: [test][2023-05-15T18:58:29Z] [0.34725874s] Checker__EApjUwcV insert -> (insert count: 3000, delete count: 0, upsert count: 0, timestamp: 441496924357394433, success count: 3000, err count: 0) (wrapper.py:30)

[2023-05-15T19:01:56.070Z] [2023-05-15 18:58:30 - INFO - ci_test]: assert insert: 0.3476133346557617 (test_all_collections_after_chaos.py:48)

[2023-05-15T19:01:56.070Z] [2023-05-15 18:58:30 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:56)

[2023-05-15T19:01:56.070Z] [2023-05-15 18:58:33 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)

[2023-05-15T19:01:56.070Z] [2023-05-15 18:58:33 - INFO - ci_test]: [test][2023-05-15T18:58:30Z] [3.01979074s] Checker__EApjUwcV flush -> None (wrapper.py:30)

[2023-05-15T19:01:56.070Z] [2023-05-15 18:58:33 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 120} (api_request.py:56)

[2023-05-15T19:01:56.070Z] [2023-05-15 18:58:33 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)

[2023-05-15T19:01:56.070Z] [2023-05-15 18:58:33 - INFO - ci_test]: [test][2023-05-15T18:58:33Z] [0.00306592s] Checker__EApjUwcV flush -> None (wrapper.py:30)

[2023-05-15T19:01:56.070Z] [2023-05-15 18:58:33 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 120} (api_request.py:56)

[2023-05-15T19:01:56.070Z] [2023-05-15 18:58:33 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)

[2023-05-15T19:01:56.070Z] [2023-05-15 18:58:33 - INFO - ci_test]: [test][2023-05-15T18:58:33Z] [0.00299321s] Checker__EApjUwcV flush -> None (wrapper.py:30)

[2023-05-15T19:01:56.070Z] [2023-05-15 18:58:33 - INFO - ci_test]: assert flush: 3.0244762897491455, entities: 22810 (test_all_collections_after_chaos.py:58)

[2023-05-15T19:01:56.070Z] [2023-05-15 18:58:33 - INFO - ci_test]: index info: [{'collection': 'Checker__EApjUwcV', 'field': 'float_vector', 'index_name': 'index__k8n3ap0Y', 'index_param': {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 48, 'efConstruction': 500}}}] (test_all_collections_after_chaos.py:74)

[2023-05-15T19:01:56.070Z] [2023-05-15 18:58:33 - DEBUG - ci_test]: (api_request)  : [Collection.load] args: [None, 1, 120], kwargs: {} (api_request.py:56)

[2023-05-15T19:01:56.070Z] [2023-05-15 18:58:33 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)

[2023-05-15T19:01:56.070Z] [2023-05-15 18:58:33 - INFO - ci_test]: [test][2023-05-15T18:58:33Z] [0.00425942s] Checker__EApjUwcV load -> None (wrapper.py:30)

[2023-05-15T19:01:56.070Z] [2023-05-15 18:58:33 - DEBUG - ci_test]: (api_request)  : [Collection.search] args: [[[0.1149621587114552, 0.06685462814498708, 0.07661291899526199, 0.017586679910027724, 0.014026415785232081, 0.02149062052149042, 0.1432756637287344, 0.03490939606006326, 0.031529282101691175, 0.05260121061001093, 0.1399928422767197, 0.014727434069132996, 0.03980788779610558, 0.15787570527853703, 0......., kwargs: {} (api_request.py:56)

[2023-05-15T19:01:56.070Z] [2023-05-15 18:58:53 - ERROR - pymilvus.decorators]: RPC error: [search], <MilvusException: (code=1, message=fail to search on all shard leaders, err=attempt #0: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_4_441496720567250375v0 is not available in any replica, err=NodeOffline(nodeID=4): attempt #1: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_4_441496720567250375v0 is not available in any replica, err=NodeOffline(nodeID=4): attempt #2: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_4_441496720567250375v0 is not available in any replica, err=NodeOffline(nodeID=2): attempt #3: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_4_441496720567250375v0 is not available in any replica, err=NodeOffline(nodeID=2): attempt #4: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_4_441496720567250375v0 is not available in any replica, err=NodeOffline(nodeID=4): attempt #5: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_4_441496720567250375v0 is not available in any replica, err=NodeOffline(nodeID=2): attempt #6: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_4_441496720567250375v0 is not available in any replica, err=NodeOffline(nodeID=4): context done during sleep after run#6: context deadline exceeded)>, <Time:{'RPC start': '2023-05-15 18:58:33.258049', 'RPC error': '2023-05-15 18:58:53.261588'}> (decorators.py:108)

[2023-05-15T19:01:56.070Z] [2023-05-15 18:58:53 - ERROR - ci_test]: Traceback (most recent call last):

[2023-05-15T19:01:56.070Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 26, in inner_wrapper

[2023-05-15T19:01:56.070Z]     res = func(*args, **_kwargs)

[2023-05-15T19:01:56.070Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 57, in api_request

[2023-05-15T19:01:56.071Z]     return func(*arg, **kwargs)

[2023-05-15T19:01:56.071Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 666, in search

[2023-05-15T19:01:56.071Z]     res = conn.search(self._name, data, anns_field, param, limit, expr,

[2023-05-15T19:01:56.071Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler

[2023-05-15T19:01:56.071Z]     raise e

[2023-05-15T19:01:56.071Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler

[2023-05-15T19:01:56.071Z]     return func(*args, **kwargs)

[2023-05-15T19:01:56.071Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler

[2023-05-15T19:01:56.071Z]     ret = func(self, *args, **kwargs)

[2023-05-15T19:01:56.071Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler

[2023-05-15T19:01:56.071Z]     raise e

[2023-05-15T19:01:56.071Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler

[2023-05-15T19:01:56.071Z]     return func(self, *args, **kwargs)

[2023-05-15T19:01:56.071Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 521, in search

[2023-05-15T19:01:56.071Z]     return self._execute_search_requests(requests, timeout, round_decimal=round_decimal, auto_id=auto_id, **kwargs)

[2023-05-15T19:01:56.071Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 490, in _execute_search_requests

[2023-05-15T19:01:56.071Z]     raise pre_err

[2023-05-15T19:01:56.071Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 481, in _execute_search_requests

[2023-05-15T19:01:56.071Z]     raise MilvusException(response.status.error_code, response.status.reason)

[2023-05-15T19:01:56.071Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=attempt #0: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_4_441496720567250375v0 is not available in any replica, err=NodeOffline(nodeID=4): attempt #1: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_4_441496720567250375v0 is not available in any replica, err=NodeOffline(nodeID=4): attempt #2: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_4_441496720567250375v0 is not available in any replica, err=NodeOffline(nodeID=2): attempt #3: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_4_441496720567250375v0 is not available in any replica, err=NodeOffline(nodeID=2): attempt #4: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_4_441496720567250375v0 is not available in any replica, err=NodeOffline(nodeID=4): attempt #5: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_4_441496720567250375v0 is not available in any replica, err=NodeOffline(nodeID=2): attempt #6: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_4_441496720567250375v0 is not available in any replica, err=NodeOffline(nodeID=4): context done during sleep after run#6: context deadline exceeded)>

[2023-05-15T19:01:56.071Z]  (api_request.py:39)

[2023-05-15T19:01:56.071Z] [2023-05-15 18:58:53 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=fail to search on all shard leaders, err=attempt #0: fail to get shard leaders from QueryCoord: channel by-dev-rootcoord-dml_4_441496720567250375v0 is not available in any replica, err=NodeOffline(nodeID=4): attempt #1: fail to get shard leaders from QueryCoord: ch...... (api_request.py:40)

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-cron/detail/chaos-test-kafka-cron/4126/pipeline

log: artifacts-querynode-pod-kill-4126-server-logs.tar.gz

artifacts-querynode-pod-kill-4126-pytest-logs.tar.gz

MrPresent-Han commented 1 year ago

this chaos test still has one more failed case: t0: search--GetShardLeaders--Get A Client t1: A down t2: B up (A, B ip's are identical) exception: target node id not match

zhuwenxing commented 1 year ago

It is not a stable issue with image tag master-20230517-7da5a31b and can not reproduce in three runs.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.