Closed zhuwenxing closed 1 year ago
/assign @jiaoew1991 /unassign
There still exists another deadlock(s) in QueryNode /assign @congqixia
https://github.com/milvus-io/milvus/issues/22469 There is a same issue created last week, not sure whether it is caused by the same root cause
/unassign /assign @zhuwenxing
plz verify it with #22622
not fix, tickler not start default only limit that watch event must complete in 30s or continuous retry because timeout, but this case not delay for watch event retry, watch event success.
It still reproduced in 2.2.0-20230308-69f4afe4 failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_for_release_cron/detail/deploy_test_for_release_cron/524/pipeline/331/
[2023-03-08T10:13:04.159Z] + python3 scripts/second_recall_test.py --host 10.101.90.144
[2023-03-08T10:13:05.081Z] 2023-03-08 10:13:05.000 | INFO | __main__:search_test:53 - recall test for index type HNSW
[2023-03-08T10:13:05.640Z] 2023-03-08 10:13:05.560 | INFO | __main__:search_test:63 -
[2023-03-08T10:13:05.640Z] Search...
[2023-03-08T10:14:56.979Z] 2023-03-08 10:14:43.866 | INFO | __main__:search_test:69 - search cost 98.3061 seconds
[2023-03-08T10:14:57.901Z] 2023-03-08 10:14:57.898 | INFO | __main__:search_test:86 - recall=0.974
[2023-03-08T10:14:57.901Z] 2023-03-08 10:14:57.898 | INFO | __main__:search_test:63 -
[2023-03-08T10:14:57.901Z] Search...
[2023-03-08T10:18:19.323Z] RPC error: [search], <MilvusException: (code=1, message=<_InactiveRpcError of RPC that terminated with:
[2023-03-08T10:18:19.323Z] status = StatusCode.DEADLINE_EXCEEDED
[2023-03-08T10:18:19.323Z] details = "Deadline Exceeded"
[2023-03-08T10:18:19.323Z] debug_error_string = "{"created":"@1678270698.101074385","description":"Error received from peer ipv4:10.101.90.144:19530","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"Deadline Exceeded","grpc_status":4}"
[2023-03-08T10:18:19.323Z] >)>, <Time:{'RPC start': '2023-03-08 10:14:57.898721', 'RPC error': '2023-03-08 10:18:18.102119'}>
[2023-03-08T10:18:19.323Z] Traceback (most recent call last):
[2023-03-08T10:18:19.323Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler
[2023-03-08T10:18:19.323Z] return func(self, *args, **kwargs)
[2023-03-08T10:18:19.323Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 518, in search
[2023-03-08T10:18:19.323Z] return self._execute_search_requests(requests, timeout, round_decimal=round_decimal, auto_id=auto_id, **kwargs)
[2023-03-08T10:18:19.323Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 487, in _execute_search_requests
[2023-03-08T10:18:19.323Z] raise pre_err
[2023-03-08T10:18:19.324Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 475, in _execute_search_requests
[2023-03-08T10:18:19.324Z] response = self._stub.Search(request, timeout=timeout)
[2023-03-08T10:18:19.324Z] File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 946, in __call__
[2023-03-08T10:18:19.324Z] return _end_unary_response_blocking(state, call, False, None)
[2023-03-08T10:18:19.324Z] File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
[2023-03-08T10:18:19.324Z] raise _InactiveRpcError(state)
[2023-03-08T10:18:19.324Z] grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
[2023-03-08T10:18:19.324Z] status = StatusCode.DEADLINE_EXCEEDED
[2023-03-08T10:18:19.324Z] details = "Deadline Exceeded"
[2023-03-08T10:18:19.324Z] debug_error_string = "{"created":"@1678270698.101074385","description":"Error received from peer ipv4:10.101.90.144:19530","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"Deadline Exceeded","grpc_status":4}"
[2023-03-08T10:18:19.324Z] >
[2023-03-08T10:18:19.324Z]
[2023-03-08T10:18:19.324Z] The above exception was the direct cause of the following exception:
[2023-03-08T10:18:19.324Z]
[2023-03-08T10:18:19.324Z] Traceback (most recent call last):
[2023-03-08T10:18:19.324Z] File "scripts/second_recall_test.py", line 103, in <module>
[2023-03-08T10:18:19.324Z] search_test(host, index_type)
[2023-03-08T10:18:19.324Z] File "scripts/second_recall_test.py", line 65, in search_test
[2023-03-08T10:18:19.324Z] res = collection.search(
[2023-03-08T10:18:19.324Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 660, in search
[2023-03-08T10:18:19.324Z] res = conn.search(self._name, data, anns_field, param, limit, expr,
[2023-03-08T10:18:19.324Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler
[2023-03-08T10:18:19.324Z] raise e
[2023-03-08T10:18:19.324Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler
[2023-03-08T10:18:19.324Z] return func(*args, **kwargs)
[2023-03-08T10:18:19.324Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler
[2023-03-08T10:18:19.324Z] ret = func(self, *args, **kwargs)
[2023-03-08T10:18:19.324Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 58, in handler
[2023-03-08T10:18:19.324Z] raise MilvusException(message=str(e)) from e
[2023-03-08T10:18:19.324Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=<_InactiveRpcError of RPC that terminated with:
[2023-03-08T10:18:19.324Z] status = StatusCode.DEADLINE_EXCEEDED
[2023-03-08T10:18:19.324Z] details = "Deadline Exceeded"
[2023-03-08T10:18:19.324Z] debug_error_string = "{"created":"@1678270698.101074385","description":"Error received from peer ipv4:10.101.90.144:19530","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"Deadline Exceeded","grpc_status":4}"
[2023-03-08T10:18:19.324Z] >)>
script returned exit code 1
The first search success but the latency was high, then the second search failed.
log:
artifacts-rocksmq-standalone-reinstall-524-server-logs.tar.gz artifacts-rocksmq-standalone-reinstall-524-pytest-logs.tar.gz
not fix, tickler not start default only limit that watch event must complete in 30s or continuous retry because timeout, but this case not delay for watch event retry, watch event success.
any coming patch?
/unassign
image tag: 2.2.0-20230309-130ab6da failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_for_release_cron/detail/deploy_test_for_release_cron/552/pipeline log: artifacts-pulsar-cluster-reinstall-552-server-logs.tar.gz artifacts-pulsar-cluster-reinstall-552-pytest-logs.tar.gz
same issue in this enviroment
It also reproduced in the upgrade.
[2023-03-13T03:13:20.802Z] + python3 scripts/second_recall_test.py --host 10.101.128.189
[2023-03-13T03:13:21.744Z] 2023-03-13 03:13:21.407 | INFO | __main__:search_test:53 - recall test for index type HNSW
[2023-03-13T03:13:21.997Z] 2023-03-13 03:13:21.869 | INFO | __main__:search_test:63 -
[2023-03-13T03:13:21.997Z] Search...
[2023-03-13T03:16:43.368Z] RPC error: [search], <MilvusException: (code=1, message=<_InactiveRpcError of RPC that terminated with:
[2023-03-13T03:16:43.368Z] status = StatusCode.DEADLINE_EXCEEDED
[2023-03-13T03:16:43.368Z] details = "Deadline Exceeded"
[2023-03-13T03:16:43.368Z] debug_error_string = "{"created":"@1678677402.050056252","description":"Error received from peer ipv4:10.101.128.189:19530","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"Deadline Exceeded","grpc_status":4}"
[2023-03-13T03:16:43.368Z] >)>, <Time:{'RPC start': '2023-03-13 03:13:21.869299', 'RPC error': '2023-03-13 03:16:42.051956'}>
[2023-03-13T03:16:43.368Z] Traceback (most recent call last):
[2023-03-13T03:16:43.368Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler
[2023-03-13T03:16:43.368Z] return func(self, *args, **kwargs)
[2023-03-13T03:16:43.368Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 518, in search
[2023-03-13T03:16:43.368Z] return self._execute_search_requests(requests, timeout, round_decimal=round_decimal, auto_id=auto_id, **kwargs)
[2023-03-13T03:16:43.368Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 487, in _execute_search_requests
[2023-03-13T03:16:43.368Z] raise pre_err
[2023-03-13T03:16:43.368Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 475, in _execute_search_requests
[2023-03-13T03:16:43.368Z] response = self._stub.Search(request, timeout=timeout)
[2023-03-13T03:16:43.369Z] File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 946, in __call__
[2023-03-13T03:16:43.369Z] return _end_unary_response_blocking(state, call, False, None)
[2023-03-13T03:16:43.369Z] File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
[2023-03-13T03:16:43.369Z] raise _InactiveRpcError(state)
[2023-03-13T03:16:43.369Z] grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
[2023-03-13T03:16:43.369Z] status = StatusCode.DEADLINE_EXCEEDED
[2023-03-13T03:16:43.369Z] details = "Deadline Exceeded"
[2023-03-13T03:16:43.369Z] debug_error_string = "{"created":"@1678677402.050056252","description":"Error received from peer ipv4:10.101.128.189:19530","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"Deadline Exceeded","grpc_status":4}"
[2023-03-13T03:16:43.369Z] >
[2023-03-13T03:16:43.369Z]
[2023-03-13T03:16:43.369Z] The above exception was the direct cause of the following exception:
[2023-03-13T03:16:43.369Z]
[2023-03-13T03:16:43.369Z] Traceback (most recent call last):
[2023-03-13T03:16:43.369Z] File "scripts/second_recall_test.py", line 103, in <module>
[2023-03-13T03:16:43.369Z] search_test(host, index_type)
[2023-03-13T03:16:43.369Z] File "scripts/second_recall_test.py", line 65, in search_test
[2023-03-13T03:16:43.369Z] res = collection.search(
[2023-03-13T03:16:43.369Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 660, in search
[2023-03-13T03:16:43.369Z] res = conn.search(self._name, data, anns_field, param, limit, expr,
[2023-03-13T03:16:43.369Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler
[2023-03-13T03:16:43.369Z] raise e
[2023-03-13T03:16:43.369Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler
[2023-03-13T03:16:43.369Z] return func(*args, **kwargs)
[2023-03-13T03:16:43.369Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler
[2023-03-13T03:16:43.369Z] ret = func(self, *args, **kwargs)
[2023-03-13T03:16:43.369Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 58, in handler
[2023-03-13T03:16:43.369Z] raise MilvusException(message=str(e)) from e
[2023-03-13T03:16:43.369Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=<_InactiveRpcError of RPC that terminated with:
[2023-03-13T03:16:43.369Z] status = StatusCode.DEADLINE_EXCEEDED
[2023-03-13T03:16:43.369Z] details = "Deadline Exceeded"
[2023-03-13T03:16:43.369Z] debug_error_string = "{"created":"@1678677402.050056252","description":"Error received from peer ipv4:10.101.128.189:19530","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"Deadline Exceeded","grpc_status":4}"
[2023-03-13T03:16:43.369Z] >)>
script returned exit code 1
milvus mode: cluster deploy task: upgrade old image tag: v2.2.2 new image tag: 2.2.0-20230310-b2ece6a5 failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_for_release_cron/detail/deploy_test_for_release_cron/579/pipeline/331/ log: artifacts-pulsar-cluster-upgrade-579-server-logs.tar.gz artifacts-pulsar-cluster-upgrade-579-pytest-logs.tar.gz
time tick lag is huge during the test
@zhuwenxing does this issue reproduce in latest 2.2.0 branch after these deadlock issue fixed?
Not reproduced in 2.2.0-20230426-05d317fe
Is there an existing issue for this?
Environment
Current Behavior
collection name: sift_128_euclidean_HNSW
[2023/03/07 07:29:11.945 +00:00] [WARN] [grpcclient/client.go:297] ["ClientBase Call grpc first call get error"] [role=querynode] [error="err: rpc error: code = Canceled desc = context canceled\n, /go/src/github.com/milvus-io/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:296 github.com/milvus-io/milvus/internal/util/grpcclient.(ClientBase[...]).Call\n/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:254 github.com/milvus-io/milvus/internal/distributed/querynode/client.(Client).Search\n/go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:513 github.com/milvus-io/milvus/internal/proxy.(searchTask).searchShard\n/go/src/github.com/milvus-io/milvus/internal/proxy/task_policies.go:131 github.com/milvus-io/milvus/internal/proxy.mergeRoundRobinPolicy.func1\n/usr/local/go/src/runtime/asm_amd64.s:1571 runtime.goexit\n"] [2023/03/07 07:29:11.945 +00:00] [WARN] [proxy/task_search.go:516] ["QueryNode search return error"] [traceID=fc3d844b8a9e9c2] [msgID=439923135112544257] [nodeID=15] [channels="[by-dev-rootcoord-dml_112_439922264092151365v0]"] [error="err: rpc error: code = Canceled desc = context canceled\n, /go/src/github.com/milvus-io/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:296 github.com/milvus-io/milvus/internal/util/grpcclient.(ClientBase[...]).Call\n/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:254 github.com/milvus-io/milvus/internal/distributed/querynode/client.(Client).Search\n/go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:513 github.com/milvus-io/milvus/internal/proxy.(searchTask).searchShard\n/go/src/github.com/milvus-io/milvus/internal/proxy/task_policies.go:131 github.com/milvus-io/milvus/internal/proxy.mergeRoundRobinPolicy.func1\n/usr/local/go/src/runtime/asm_amd64.s:1571 runtime.goexit\n"] [2023/03/07 07:29:11.945 +00:00] [WARN] [proxy/task_policies.go:132] ["failed to do query with node"] [traceID=fc3d844b8a9e9c2] [nodeID=15] [dmlChannels="[by-dev-rootcoord-dml_112_439922264092151365v0]"] [error="code: NotReadyServe, error: err: rpc error: code = Canceled desc = context canceled\n, /go/src/github.com/milvus-io/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:296 github.com/milvus-io/milvus/internal/util/grpcclient.(ClientBase[...]).Call\n/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:254 github.com/milvus-io/milvus/internal/distributed/querynode/client.(Client).Search\n/go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:513 github.com/milvus-io/milvus/internal/proxy.(searchTask).searchShard\n/go/src/github.com/milvus-io/milvus/internal/proxy/task_policies.go:131 github.com/milvus-io/milvus/internal/proxy.mergeRoundRobinPolicy.func1\n/usr/local/go/src/runtime/asm_amd64.s:1571 runtime.goexit\n"] [2023/03/07 07:29:11.945 +00:00] [WARN] [grpcclient/client.go:297] ["ClientBase Call grpc first call get error"] [role=querynode] [error="err: rpc error: code = Canceled desc = context canceled\n, /go/src/github.com/milvus-io/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:296 github.com/milvus-io/milvus/internal/util/grpcclient.(ClientBase[...]).Call\n/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:254 github.com/milvus-io/milvus/internal/distributed/querynode/client.(Client).Search\n/go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:513 github.com/milvus-io/milvus/internal/proxy.(searchTask).searchShard\n/go/src/github.com/milvus-io/milvus/internal/proxy/task_policies.go:131 github.com/milvus-io/milvus/internal/proxy.mergeRoundRobinPolicy.func1\n/usr/local/go/src/runtime/asm_amd64.s:1571 runtime.goexit\n"] [2023/03/07 07:29:11.945 +00:00] [WARN] [proxy/task_search.go:516] ["QueryNode search return error"] [traceID=fc3d844b8a9e9c2] [msgID=439923135112544257] [nodeID=22] [channels="[by-dev-rootcoord-dml_113_439922264092151365v1]"] [error="err: rpc error: code = Canceled desc = context canceled\n, /go/src/github.com/milvus-io/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:296 github.com/milvus-io/milvus/internal/util/grpcclient.(ClientBase[...]).Call\n/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:254 github.com/milvus-io/milvus/internal/distributed/querynode/client.(Client).Search\n/go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:513 github.com/milvus-io/milvus/internal/proxy.(searchTask).searchShard\n/go/src/github.com/milvus-io/milvus/internal/proxy/task_policies.go:131 github.com/milvus-io/milvus/internal/proxy.mergeRoundRobinPolicy.func1\n/usr/local/go/src/runtime/asm_amd64.s:1571 runtime.goexit\n"] [2023/03/07 07:29:11.945 +00:00] [WARN] [proxy/task_policies.go:132] ["failed to do query with node"] [traceID=fc3d844b8a9e9c2] [nodeID=22] [dmlChannels="[by-dev-rootcoord-dml_113_439922264092151365v1]"] [error="code: NotReadyServe, error: err: rpc error: code = Canceled desc = context canceled\n, /go/src/github.com/milvus-io/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:296 github.com/milvus-io/milvus/internal/util/grpcclient.(ClientBase[...]).Call\n/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:254 github.com/milvus-io/milvus/internal/distributed/querynode/client.(Client).Search\n/go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:513 github.com/milvus-io/milvus/internal/proxy.(searchTask).searchShard\n/go/src/github.com/milvus-io/milvus/internal/proxy/task_policies.go:131 github.com/milvus-io/milvus/internal/proxy.mergeRoundRobinPolicy.func1\n/usr/local/go/src/runtime/asm_amd64.s:1571 runtime.goexit\n"] [2023/03/07 07:29:11.945 +00:00] [WARN] [proxy/task_policies.go:159] ["retry another query node with round robin"] [traceID=fc3d844b8a9e9c2] [Nexts="{\"by-dev-rootcoord-dml_112_439922264092151365v0\":22,\"by-dev-rootcoord-dml_113_439922264092151365v1\":16}"] [2023/03/07 07:29:11.945 +00:00] [WARN] [proxy/impl.go:2839] ["Search failed to WaitToFinish"] [traceID=fc3d844b8a9e9c2] [error="proxy TaskCondition context Done"] [role=proxy] [msgID=439923135112544257] [db=] [collection=deploy_test_index_type_BIN_IVF_FLAT_is_compacted_not_compacted_segment_status_all_is_string_indexed_is_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000] [partitions="[]"] [dsl="int64 >= 0"] [len(PlaceholderGroup)=44] [OutputFields="[int64]"] [search_params="[{\"key\":\"anns_field\",\"value\":\"binary_vector\"},{\"key\":\"topk\",\"value\":\"10\"},{\"key\":\"metric_type\",\"value\":\"HAMMING\"},{\"key\":\"params\",\"value\":\"{\\"nprobe\\":10}\"},{\"key\":\"round_decimal\",\"value\":\"-1\"},{\"key\":\"offset\",\"value\":\"0\"}]"] [travel_timestamp=0] [guarantee_timestamp=0]
Expected Behavior
No response
Steps To Reproduce
No response
Milvus Log
failed job; https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_for_release_cron/detail/deploy_test_for_release_cron/492/pipeline
log: artifacts-pulsar-cluster-reinstall-492-server-logs.tar.gz artifacts-pulsar-cluster-reinstall-492-pytest-logs.tar.gz
Anything else?
No response