Closed zhuwenxing closed 1 year ago
may related to https://github.com/milvus-io/milvus/issues/20970
Failed to unsubscribe, with pulsar connection closed error. We need a strategy to take the case of failed to re-connect to pulsar @congqixia
/assign
https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test/detail/chaos-test/3286/pipeline/ Pulsar recovery takes over 1min, I set the idle time to 3min, and chaos test passed @zhuwenxing
Also, Milvus's etcd lease timeout is 1min, idle time should be greater than this to make sure the system has recovered after chaos /cc @xiaofan-luan /cc @congqixia
so we need a fit idle timeout for milvus? any suggestion? @yah01
so we need a fit idle timeout for milvus? any suggestion? @yah01
2 minutes should be enough to recover
I have changed the idle time to 3 min
/assign @zhuwenxing please help to verify it as the idle time updated.
/unassign
it is still reproduced, but the error message has been changed.
[2022-12-12T04:59:17.064Z] [2022-12-12 04:59:16 - DEBUG - ci_test]: (api_request) : [Collection.search] args: [[[0.10011674241569275, 0.026484448898056547, 0.1174656922074973, 0.14838325856782653, 0.037432259910172905, 0.0819583382143737, 0.044623756846596356, 0.08764375063613665, 0.033402024552142126, 0.07062440934806637, 0.05836438474772751, 0.06159803316382118, 0.11255150425240625, 0.10276470323752888, 0......, kwargs: {} (api_request.py:56)
[2022-12-12T04:59:17.064Z] [2022-12-12 04:59:16 - ERROR - pymilvus.decorators]: RPC error: [search], <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=3, reason=QueryNode 12 can't serve, recovering: target node id not match target id = 3, node id = 12)>, <Time:{'RPC start': '2022-12-12 04:59:16.418005', 'RPC error': '2022-12-12 04:59:16.783941'}> (decorators.py:108)
[2022-12-12T04:59:17.064Z] [2022-12-12 04:59:16 - ERROR - ci_test]: Traceback (most recent call last):
[2022-12-12T04:59:17.065Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 26, in inner_wrapper
[2022-12-12T04:59:17.065Z] res = func(*args, **_kwargs)
[2022-12-12T04:59:17.065Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 57, in api_request
[2022-12-12T04:59:17.065Z] return func(*arg, **kwargs)
[2022-12-12T04:59:17.065Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 610, in search
[2022-12-12T04:59:17.065Z] res = conn.search(self._name, data, anns_field, param, limit, expr,
[2022-12-12T04:59:17.065Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler
[2022-12-12T04:59:17.065Z] raise e
[2022-12-12T04:59:17.065Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler
[2022-12-12T04:59:17.065Z] return func(*args, **kwargs)
[2022-12-12T04:59:17.065Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler
[2022-12-12T04:59:17.065Z] ret = func(self, *args, **kwargs)
[2022-12-12T04:59:17.065Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler
[2022-12-12T04:59:17.065Z] raise e
[2022-12-12T04:59:17.065Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler
[2022-12-12T04:59:17.065Z] return func(self, *args, **kwargs)
[2022-12-12T04:59:17.065Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 469, in search
[2022-12-12T04:59:17.065Z] return self._execute_search_requests(requests, timeout, round_decimal=round_decimal, auto_id=auto_id, **kwargs)
[2022-12-12T04:59:17.065Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 438, in _execute_search_requests
[2022-12-12T04:59:17.065Z] raise pre_err
[2022-12-12T04:59:17.065Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 429, in _execute_search_requests
[2022-12-12T04:59:17.065Z] raise MilvusException(response.status.error_code, response.status.reason)
[2022-12-12T04:59:17.065Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=3, reason=QueryNode 12 can't serve, recovering: target node id not match target id = 3, node id = 12)>
[2022-12-12T04:59:17.065Z] (api_request.py:39)
[2022-12-12T04:59:17.065Z] [2022-12-12 04:59:16 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=3, reason=QueryNode 12 can't serve, recovering: target node id not match target id = 3, node id = 12)> (api_request.py:40)
chaos type: pod-kill image tag: master-20221212-e977e014 target pod: pulsar failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/467/pipeline log: artifacts-pulsar-pod-kill-467-server-logs.tar.gz artifacts-pulsar-pod-kill-467-pytest-logs.tar.gz
/unassign /assign @yah01
It is also reproduced in 2.2 branch chaos type: pod-failure image tag: 2.2.0-20221212-184d3c35 target pod: querynode failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-for-release-cron/detail/chaos-test-for-release-cron/405/pipeline
log: artifacts-querynode-pod-failure-405-server-logs.tar.gz artifacts-querynode-pod-failure-405-pytest-logs.tar.gz
chaos type: pod-kill image tag: 2.2.0-20221212-184d3c35 target pod: etcd failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-for-release-cron/detail/chaos-test-for-release-cron/395/pipeline log: artifacts-etcd-pod-kill-395-server-logs.tar.gz artifacts-etcd-pod-kill-395-pytest-logs.tar.gz
@zhuwenxing does this still reproduce?
does this still reproduce?
It is not reproduced in the master branch but is still reproduced in the 2.2 branch. @yah01 @congqixia Any cherry-pick PR in 2.2?
[2022-12-19T03:21:16.450Z] [2022-12-19 03:21:15 - INFO - ci_test]: index info: [{'collection': 'Hello_Milvus', 'field': 'varchar', 'index_name': 'test_SKUVzBYh', 'index_param': {'index_type': 'Trie'}}, {'collection': 'Hello_Milvus', 'field': 'float_vector', 'index_name': 'test_dOcvvIOR', 'index_param': {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 48, 'efConstruction': 500}}}] (test_data_persistence.py:64)
[2022-12-19T03:21:16.450Z] [2022-12-19 03:21:15 - DEBUG - ci_test]: (api_request) : [Collection.load] args: [None, 1, 120], kwargs: {} (api_request.py:56)
[2022-12-19T03:21:16.450Z] [2022-12-19 03:21:15 - DEBUG - ci_test]: (api_response) : None (api_request.py:31)
[2022-12-19T03:21:16.450Z] [2022-12-19 03:21:15 - INFO - ci_test]: [test][2022-12-19T03:21:15Z] [0.00528159s] Hello_Milvus load -> None (wrapper.py:30)
[2022-12-19T03:21:16.450Z] [2022-12-19 03:21:15 - DEBUG - ci_test]: (api_request) : [Collection.search] args: [[[0.12213514804040805, 0.1380733812164912, 0.09017636293140217, 0.1075027498853586, 0.028142317306075682, 0.01582499737221483, 0.12412341708114152, 0.03861188170183319, 0.036823224957653694, 0.06920357325807211, 0.14432174786304863, 0.0016837899327997307, 0.11182058393475187, 0.1409779737635982, 0......., kwargs: {} (api_request.py:56)
[2022-12-19T03:21:16.450Z] [2022-12-19 03:21:16 - ERROR - pymilvus.decorators]: RPC error: [search], <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=7, reason=QueryNode 12 can't serve, recovering: target node id not match target id = 7, node id = 12)>, <Time:{'RPC start': '2022-12-19 03:21:15.734553', 'RPC error': '2022-12-19 03:21:16.129307'}> (decorators.py:108)
[2022-12-19T03:21:16.450Z] [2022-12-19 03:21:16 - ERROR - ci_test]: Traceback (most recent call last):
[2022-12-19T03:21:16.450Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 26, in inner_wrapper
[2022-12-19T03:21:16.450Z] res = func(*args, **_kwargs)
[2022-12-19T03:21:16.450Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 57, in api_request
[2022-12-19T03:21:16.450Z] return func(*arg, **kwargs)
[2022-12-19T03:21:16.451Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 609, in search
[2022-12-19T03:21:16.451Z] res = conn.search(self._name, data, anns_field, param, limit, expr,
[2022-12-19T03:21:16.451Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler
[2022-12-19T03:21:16.451Z] raise e
[2022-12-19T03:21:16.451Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler
[2022-12-19T03:21:16.451Z] return func(*args, **kwargs)
[2022-12-19T03:21:16.451Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler
[2022-12-19T03:21:16.451Z] ret = func(self, *args, **kwargs)
[2022-12-19T03:21:16.451Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler
[2022-12-19T03:21:16.451Z] raise e
[2022-12-19T03:21:16.451Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler
[2022-12-19T03:21:16.451Z] return func(self, *args, **kwargs)
[2022-12-19T03:21:16.451Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 470, in search
[2022-12-19T03:21:16.451Z] return self._execute_search_requests(requests, timeout, round_decimal=round_decimal, auto_id=auto_id, **kwargs)
[2022-12-19T03:21:16.451Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 439, in _execute_search_requests
[2022-12-19T03:21:16.451Z] raise pre_err
[2022-12-19T03:21:16.451Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 430, in _execute_search_requests
[2022-12-19T03:21:16.451Z] raise MilvusException(response.status.error_code, response.status.reason)
[2022-12-19T03:21:16.451Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=7, reason=QueryNode 12 can't serve, recovering: target node id not match target id = 7, node id = 12)>
[2022-12-19T03:21:16.451Z] (api_request.py:39)
[2022-12-19T03:21:16.451Z] [2022-12-19 03:21:16 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=7, reason=QueryNode 12 can't serve, recovering: target node id not match target id = 7, node id = 12)> (api_request.py:40)
[2022-12-19T03:21:16.451Z] ------------- generated html file: file:///tmp/ci_logs/report.html -------------
[2022-12-19T03:21:16.451Z] =========================== short test summary info ============================
[2022-12-19T03:21:16.451Z] FAILED testcases/test_data_persistence.py::TestDataPersistence::test_milvus_default - AssertionError
[2022-12-19T03:21:16.451Z] ============================== 1 failed in 4.26s ===============================
chaos type: pod-failure image tag: 2.2.0-20221216-1aa7a9a8 target pod: querynode failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-for-release-cron/detail/chaos-test-for-release-cron/550/pipeline log: artifacts-querynode-pod-failure-550-server-logs.tar.gz artifacts-querynode-pod-failure-550-pytest-logs.tar.gz
/assign @zhuwenxing 2.2 fix merged
[2022-12-22T21:25:06.617Z] [2022-12-22 21:22:34 - DEBUG - ci_test]: (api_request) : [Collection.load] args: [None, 1, 120], kwargs: {} (api_request.py:56)
[2022-12-22T21:25:06.617Z] [2022-12-22 21:22:34 - DEBUG - ci_test]: (api_response) : None (api_request.py:31)
[2022-12-22T21:25:06.617Z] [2022-12-22 21:22:34 - INFO - ci_test]: [test][2022-12-22T21:22:34Z] [0.00430286s] Checker__MAieYr3G load -> None (wrapper.py:30)
[2022-12-22T21:25:06.617Z] [2022-12-22 21:22:34 - DEBUG - ci_test]: (api_request) : [Collection.search] args: [[[0.1034426524979644, 0.052573964560210795, 0.09321532694772285, 0.13742976129167736, 0.0874793833159194, 0.14936034785769273, 0.11336207724346294, 0.0998346228709992, 0.10479114113346255, 0.11088699606972784, 0.13533603435245778, 0.0014681036676781197, 0.0717271918451279, 0.04021748196965942, 0.03......, kwargs: {} (api_request.py:56)
[2022-12-22T21:25:06.617Z] [2022-12-22 21:22:35 - ERROR - pymilvus.decorators]: RPC error: [search], <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=2, reason=QueryNode 0 can't serve, recovering: target node id not match target id = 2, node id = 0)>, <Time:{'RPC start': '2022-12-22 21:22:34.321138', 'RPC error': '2022-12-22 21:22:35.130073'}> (decorators.py:108)
[2022-12-22T21:25:06.617Z] [2022-12-22 21:22:35 - ERROR - ci_test]: Traceback (most recent call last):
[2022-12-22T21:25:06.617Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 26, in inner_wrapper
[2022-12-22T21:25:06.617Z] res = func(*args, **_kwargs)
[2022-12-22T21:25:06.617Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 57, in api_request
[2022-12-22T21:25:06.617Z] return func(*arg, **kwargs)
[2022-12-22T21:25:06.617Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 609, in search
[2022-12-22T21:25:06.617Z] res = conn.search(self._name, data, anns_field, param, limit, expr,
[2022-12-22T21:25:06.617Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler
[2022-12-22T21:25:06.617Z] raise e
[2022-12-22T21:25:06.617Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler
[2022-12-22T21:25:06.617Z] return func(*args, **kwargs)
[2022-12-22T21:25:06.617Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler
[2022-12-22T21:25:06.617Z] ret = func(self, *args, **kwargs)
[2022-12-22T21:25:06.617Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler
[2022-12-22T21:25:06.617Z] raise e
[2022-12-22T21:25:06.617Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler
[2022-12-22T21:25:06.617Z] return func(self, *args, **kwargs)
[2022-12-22T21:25:06.617Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 470, in search
[2022-12-22T21:25:06.617Z] return self._execute_search_requests(requests, timeout, round_decimal=round_decimal, auto_id=auto_id, **kwargs)
[2022-12-22T21:25:06.617Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 439, in _execute_search_requests
[2022-12-22T21:25:06.617Z] raise pre_err
[2022-12-22T21:25:06.617Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 430, in _execute_search_requests
[2022-12-22T21:25:06.617Z] raise MilvusException(response.status.error_code, response.status.reason)
[2022-12-22T21:25:06.617Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=2, reason=QueryNode 0 can't serve, recovering: target node id not match target id = 2, node id = 0)>
[2022-12-22T21:25:06.617Z] (api_request.py:39)
[2022-12-22T21:25:06.617Z] [2022-12-22 21:22:35 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=2, reason=QueryNode 0 can't serve, recovering: target node id not match target id = 2, node id = 0)> (api_request.py:40)
chaos type: pod-failure image tag: master-20221222-98088e3b target pod: pulsar failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/734/pipeline log: artifacts-pulsar-pod-failure-734-server-logs.tar.gz artifacts-pulsar-pod-failure-734-pytest-logs.tar.gz
@yah01 Please take a look. It was still reproduced in master
@yah01 Please take a look. It was still reproduced in master
The QueryNode reports that the subscription is re-subscribed, and then panic, maybe upgrade the pulsar SDK will help https://github.com/milvus-io/milvus/pull/21456
Verified and passed with 2.2.0-20230202-161725a6
Is there an existing issue for this?
Environment
Current Behavior
Expected Behavior
all test cases passed
Steps To Reproduce
No response
Milvus Log
failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/369/pipeline log: artifacts-pulsar-pod-kill-369-server-logs.tar.gz artifacts-pulsar-pod-kill-369-pytest-logs.tar.gz
Anything else?
No response