milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.27k stars 2.91k forks source link

[Bug]: Query failed with error `Query 2 failed, reason err err: rpc error: code = Canceled desc = context canceled` after datanode pod kill and failure chao test #22416

Closed zhuwenxing closed 1 year ago

zhuwenxing commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version:2.2.0-20230224-8f1bcc37
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):pulsar    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

[2023-02-27T02:29:26.715Z] [2023-02-27 02:28:45 - INFO - ci_test]: assert load: 6.703508377075195 (test_all_collections_after_chaos.py:99)

[2023-02-27T02:29:26.715Z] [2023-02-27 02:28:45 - DEBUG - ci_test]: (api_request)  : [Collection.search] args: [[[0.04801366368451488, 0.02801101223208525, 0.050857166989082506, 0.03974870715490696, 0.07011696681413884, 0.12866428977880587, 0.1037269169154493, 0.040541936557997246, 0.11530567936278643, 0.12347338878279429, 0.1058253659936845, 0.11649019942575772, 0.08549109339191814, 0.08185064506834168, 0.1......, kwargs: {} (api_request.py:56)

[2023-02-27T02:29:26.715Z] [2023-02-27 02:28:46 - DEBUG - ci_test]: (api_response) : ["['(distance: 0.2931494116783142, id: 1759)', '(distance: 0.3026074171066284, id: 2751)', '(distance: 0.3063923716545105, id: 2598)', '(distance: 0.31287455558776855, id: 936)', '(distance: 0.3143373727798462, id: 2475)']", "['(distance: 0.3282947540283203, id: 1539)', '(distance: 0.331110209226608......  (api_request.py:31)

[2023-02-27T02:29:26.715Z] [2023-02-27 02:28:46 - INFO - ci_test]: [test][2023-02-27T02:28:45Z] [0.78098398s] Checker__VFB0jblM search -> <pymilvus.orm.search.SearchResult object at 0x7f8ccf00b1f0> (wrapper.py:30)

[2023-02-27T02:29:26.715Z] [2023-02-27 02:28:46 - INFO - ci_test]: assert search: 0.7811799049377441 (test_all_collections_after_chaos.py:110)

[2023-02-27T02:29:26.715Z] [2023-02-27 02:28:46 - DEBUG - ci_test]: (api_request)  : [Collection.query] args: ['int64 in [1, 2, 3, 4]', None, None, 120], kwargs: {} (api_request.py:56)

[2023-02-27T02:29:26.715Z] [2023-02-27 02:28:47 - ERROR - pymilvus.decorators]: RPC error: [query], <MilvusException: (code=1, message=fail to query on all shard leaders, err=All attempts results:

[2023-02-27T02:29:26.715Z] attempt #1:code: UnexpectedError, error: fail to Query, QueryNode ID = 4, reason=Query 2 failed, reason  err err: rpc error: code = Canceled desc = context canceled

[2023-02-27T02:29:26.715Z] , /go/src/github.com/milvus-io/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace

[2023-02-27T02:29:26.715Z] /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:285 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call

[2023-02-27T02:29:26.715Z] /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:268 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Query

[2023-02-27T02:29:26.715Z] /go/src/github.com/milvus-io/milvus/internal/querynode/shard_cluster.go:875 github.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Query.func2

[2023-02-27T02:29:26.715Z] /usr/local/go/src/runtime/asm_amd64.s:1571 runtime.goexit

[2023-02-27T02:29:26.715Z] 

[2023-02-27T02:29:26.715Z] attempt #2:context canceled

[2023-02-27T02:29:26.715Z] )>, <Time:{'RPC start': '2023-02-27 02:28:46.473317', 'RPC error': '2023-02-27 02:28:47.202740'}> (decorators.py:108)

[2023-02-27T02:29:26.715Z] [2023-02-27 02:28:47 - ERROR - ci_test]: Traceback (most recent call last):

[2023-02-27T02:29:26.715Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 26, in inner_wrapper

[2023-02-27T02:29:26.715Z]     res = func(*args, **_kwargs)

[2023-02-27T02:29:26.715Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 57, in api_request

[2023-02-27T02:29:26.715Z]     return func(*arg, **kwargs)

[2023-02-27T02:29:26.715Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 698, in query

[2023-02-27T02:29:26.715Z]     res = conn.query(self._name, expr, output_fields, partition_names,

[2023-02-27T02:29:26.715Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler

[2023-02-27T02:29:26.715Z]     raise e

[2023-02-27T02:29:26.715Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler

[2023-02-27T02:29:26.715Z]     return func(*args, **kwargs)

[2023-02-27T02:29:26.715Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler

[2023-02-27T02:29:26.715Z]     ret = func(self, *args, **kwargs)

[2023-02-27T02:29:26.715Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler

[2023-02-27T02:29:26.715Z]     raise e

[2023-02-27T02:29:26.715Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler

[2023-02-27T02:29:26.715Z]     return func(self, *args, **kwargs)

[2023-02-27T02:29:26.715Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 914, in query

[2023-02-27T02:29:26.715Z]     raise MilvusException(response.status.error_code, response.status.reason)

[2023-02-27T02:29:26.715Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to query on all shard leaders, err=All attempts results:

[2023-02-27T02:29:26.715Z] attempt #1:code: UnexpectedError, error: fail to Query, QueryNode ID = 4, reason=Query 2 failed, reason  err err: rpc error: code = Canceled desc = context canceled

[2023-02-27T02:29:26.715Z] , /go/src/github.com/milvus-io/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace

[2023-02-27T02:29:26.715Z] /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:285 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call

[2023-02-27T02:29:26.716Z] /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:268 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Query

[2023-02-27T02:29:26.716Z] /go/src/github.com/milvus-io/milvus/internal/querynode/shard_cluster.go:875 github.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Query.func2

[2023-02-27T02:29:26.716Z] /usr/local/go/src/runtime/asm_amd64.s:1571 runtime.goexit

[2023-02-27T02:29:26.716Z] 

[2023-02-27T02:29:26.716Z] attempt #2:context canceled

[2023-02-27T02:29:26.716Z] )>

[2023-02-27T02:29:26.716Z]  (api_request.py:39)

[2023-02-27T02:29:26.716Z] [2023-02-27 02:28:47 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=fail to query on all shard leaders, err=All attempts results:

[2023-02-27T02:29:26.716Z] attempt #1:code: UnexpectedError, error: fail to Query, QueryNode ID = 4, reason=Query 2 failed, reason  err err: rpc error: code = Canceled desc = context canceled

[2023-02-27T02:29:26.716Z] , /go/src/github.com/milvus-io/milvus/...... (api_request.py:40)

[2023-02-27T02:29:26.716Z] ------------- generated html file: file:///tmp/ci_logs/report.html -------------

[2023-02-27T02:29:26.716Z] =========================== short test summary info ============================

[2023-02-27T02:29:26.716Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__VFB0jblM] - AssertionError

[2023-02-27T02:29:26.716Z] =================== 1 failed, 11 passed in 141.71s (0:02:21) ===================

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

chaos type: pod-failure image tag: 2.2.0-20230224-8f1bcc37 target pod: datanode failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-for-release-cron/detail/chaos-test-for-release-cron/2250/pipeline

log: artifacts-datanode-pod-failure-2250-server-logs.tar.gz

artifacts-datanode-pod-failure-2250-pytest-logs.tar.gz

Anything else?

No response

yanliang567 commented 1 year ago

/assign @jiaoew1991 /unassign

jiaoew1991 commented 1 year ago

/assign @XuanYang-cn /unassign

XuanYang-cn commented 1 year ago
22697 [2023/02/27 02:28:47.200 +00:00] [WARN] [retry/retry.go:44] ["retry func failed"] ["retry time"=0] [error="failed to get service timestamp, taskID = 439737365683503105, collectionID = 439737173703855648, err=cannot found tSafer, vChannel = by-dev-rootcoord-delta_16_439737173703855648v0"]
XuanYang-cn commented 1 year ago

Same issue as #22397, should be fixed by #22414, @zhuwenxing Please help verify /unassign /assign @zhuwenxing

zhuwenxing commented 1 year ago

Not reproduced in 2.2.0-20230301-820ec0fc