milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
31.03k stars 2.95k forks source link

[Bug]: Search failed with error `attempt #1:err: rpc error: code = Unavailable i/o timeout` after querynode pod kill chaos test #22419

Closed zhuwenxing closed 1 year ago

zhuwenxing commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version:2.2.0-20230224-8f1bcc37
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):kafka    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior


[2023-02-26T23:13:17.152Z] [2023-02-26 23:13:11 - INFO - ci_test]: assert flush: 3.0204169750213623, entities: 9000 (test_data_persistence.py:46)

[2023-02-26T23:13:17.152Z] [2023-02-26 23:13:11 - INFO - ci_test]: index info: [{'collection': 'Hello_Milvus', 'field': 'float_vector', 'index_name': 'test_jdarZcq1', 'index_param': {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 48, 'efConstruction': 500}}}, {'collection': 'Hello_Milvus', 'field': 'varchar', 'index_name': 'test_CXK63Mhq', 'index_param': {'index_type': 'Trie'}}] (test_data_persistence.py:65)

[2023-02-26T23:13:17.152Z] [2023-02-26 23:13:11 - DEBUG - ci_test]: (api_request)  : [Collection.load] args: [None, 1, 120], kwargs: {} (api_request.py:56)

[2023-02-26T23:13:17.152Z] [2023-02-26 23:13:11 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)

[2023-02-26T23:13:17.152Z] [2023-02-26 23:13:11 - INFO - ci_test]: [test][2023-02-26T23:13:11Z] [0.00393131s] Hello_Milvus load -> None (wrapper.py:30)

[2023-02-26T23:13:17.152Z] [2023-02-26 23:13:11 - DEBUG - ci_test]: (api_request)  : [Collection.search] args: [[[0.12701550971699796, 0.12344526610707085, 0.1098368089403498, 0.03658770368870876, 0.08132087537392825, 0.01345613669507725, 0.00870447018715795, 0.06999797831288418, 0.10946947737999886, 0.04306615677614872, 0.0926669860146939, 0.027888171358313826, 0.08561984190722548, 0.0010762648594856434, 0......., kwargs: {} (api_request.py:56)

[2023-02-26T23:13:17.152Z] [2023-02-26 23:13:16 - ERROR - pymilvus.decorators]: RPC error: [search], <MilvusException: (code=1, message=fail to search on all shard leaders, err=All attempts results:

[2023-02-26T23:13:17.152Z] attempt #1:err: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.102.7.193:21123: i/o timeout"

[2023-02-26T23:13:17.152Z] , /go/src/github.com/milvus-io/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace

[2023-02-26T23:13:17.152Z] /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:285 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call

[2023-02-26T23:13:17.152Z] /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:254 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Search

[2023-02-26T23:13:17.152Z] /go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:513 github.com/milvus-io/milvus/internal/proxy.(*searchTask).searchShard

[2023-02-26T23:13:17.152Z] /go/src/github.com/milvus-io/milvus/internal/proxy/task_policies.go:131 github.com/milvus-io/milvus/internal/proxy.mergeRoundRobinPolicy.func1

[2023-02-26T23:13:17.152Z] /usr/local/go/src/runtime/asm_amd64.s:1571 runtime.goexit

[2023-02-26T23:13:17.152Z] 

[2023-02-26T23:13:17.152Z] attempt #2:context canceled

[2023-02-26T23:13:17.152Z] )>, <Time:{'RPC start': '2023-02-26 23:13:11.660165', 'RPC error': '2023-02-26 23:13:16.662656'}> (decorators.py:108)

[2023-02-26T23:13:17.152Z] [2023-02-26 23:13:16 - ERROR - ci_test]: Traceback (most recent call last):

[2023-02-26T23:13:17.152Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 26, in inner_wrapper

[2023-02-26T23:13:17.152Z]     res = func(*args, **_kwargs)

[2023-02-26T23:13:17.152Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 57, in api_request

[2023-02-26T23:13:17.152Z]     return func(*arg, **kwargs)

[2023-02-26T23:13:17.152Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 614, in search

[2023-02-26T23:13:17.152Z]     res = conn.search(self._name, data, anns_field, param, limit, expr,

[2023-02-26T23:13:17.152Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler

[2023-02-26T23:13:17.152Z]     raise e

[2023-02-26T23:13:17.152Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler

[2023-02-26T23:13:17.152Z]     return func(*args, **kwargs)

[2023-02-26T23:13:17.152Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler

[2023-02-26T23:13:17.152Z]     ret = func(self, *args, **kwargs)

[2023-02-26T23:13:17.152Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler

[2023-02-26T23:13:17.152Z]     raise e

[2023-02-26T23:13:17.152Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler

[2023-02-26T23:13:17.152Z]     return func(self, *args, **kwargs)

[2023-02-26T23:13:17.152Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 483, in search

[2023-02-26T23:13:17.152Z]     return self._execute_search_requests(requests, timeout, round_decimal=round_decimal, auto_id=auto_id, **kwargs)

[2023-02-26T23:13:17.152Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 452, in _execute_search_requests

[2023-02-26T23:13:17.152Z]     raise pre_err

[2023-02-26T23:13:17.153Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 443, in _execute_search_requests

[2023-02-26T23:13:17.153Z]     raise MilvusException(response.status.error_code, response.status.reason)

[2023-02-26T23:13:17.153Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=All attempts results:

[2023-02-26T23:13:17.153Z] attempt #1:err: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.102.7.193:21123: i/o timeout"

[2023-02-26T23:13:17.153Z] , /go/src/github.com/milvus-io/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace

[2023-02-26T23:13:17.153Z] /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:285 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call

[2023-02-26T23:13:17.153Z] /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:254 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Search

[2023-02-26T23:13:17.153Z] /go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:513 github.com/milvus-io/milvus/internal/proxy.(*searchTask).searchShard

[2023-02-26T23:13:17.153Z] /go/src/github.com/milvus-io/milvus/internal/proxy/task_policies.go:131 github.com/milvus-io/milvus/internal/proxy.mergeRoundRobinPolicy.func1

[2023-02-26T23:13:17.153Z] /usr/local/go/src/runtime/asm_amd64.s:1571 runtime.goexit

[2023-02-26T23:13:17.153Z] 

[2023-02-26T23:13:17.153Z] attempt #2:context canceled

[2023-02-26T23:13:17.153Z] )>

[2023-02-26T23:13:17.153Z]  (api_request.py:39)

[2023-02-26T23:13:17.153Z] [2023-02-26 23:13:16 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=fail to search on all shard leaders, err=All attempts results:

[2023-02-26T23:13:17.153Z] attempt #1:err: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.102.7.193:21123: i/o timeout"

[2023-02-26T23:13:17.153Z] , /go/src/github.com/milvus-io/milvus/internal/uti...... (api_request.py:40)

[2023-02-26T23:13:17.153Z] ------------- generated html file: file:///tmp/ci_logs/report.html -------------

[2023-02-26T23:13:17.153Z] =========================== short test summary info ============================

[2023-02-26T23:13:17.153Z] FAILED testcases/test_data_persistence.py::TestDataPersistence::test_milvus_default - AssertionError

[2023-02-26T23:13:17.153Z] ============================== 1 failed in 9.70s ===============================

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-for-release-cron/detail/chaos-test-kafka-for-release-cron/2280/pipeline log: artifacts-querynode-pod-kill-2280-server-logs.tar.gz

artifacts-querynode-pod-kill-2280-pytest-logs.tar.gz

Anything else?

No response

yanliang567 commented 1 year ago

/assign @jiaoew1991 /unassign

jiaoew1991 commented 1 year ago

/assign @sunby /unassign

jiaoew1991 commented 1 year ago

looks the same as #22435 pls verify it with https://github.com/milvus-io/milvus/pull/22470

/assign @zhuwenxing /unassign @sunby

zhuwenxing commented 1 year ago

verified and passed with 2.2.0-20230310-b2ece6a5