milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.87k stars 2.94k forks source link

[Bug]: query failed with error `service internal error: target version mismatch` after querynode pod failure chaos test #37902

Open zhuwenxing opened 1 day ago

zhuwenxing commented 1 day ago

Is there an existing issue for this?

Environment

- Milvus version:master-20241120-7ba85504-amd64
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior


[2024-11-20T09:13:47.476Z] <name>: Checker__HyvbZlCy

[2024-11-20T09:13:47.476Z] <description>: 

[2024-11-20T09:13:47.476Z] <schema>: {'auto_id': False, 'description': '', 'fields': [{'name': 'int64', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float', 'description': '', 'type': <DataType.FLOAT: 10>}......  (api_request.py:37)

[2024-11-20T09:13:47.476Z] [2024-11-20 09:10:40 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:62)

[2024-11-20T09:13:47.476Z] [2024-11-20 09:10:43 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)

[2024-11-20T09:13:47.476Z] [2024-11-20 09:10:43 - DEBUG - ci_test]: (api_request)  : [Collection.compact] args: [False, 180], kwargs: {} (api_request.py:62)

[2024-11-20T09:13:47.476Z] [2024-11-20 09:10:43 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)

[2024-11-20T09:13:47.476Z] [2024-11-20 09:10:44 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:62)

[2024-11-20T09:13:47.476Z] [2024-11-20 09:10:51 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)

[2024-11-20T09:13:47.476Z] [2024-11-20 09:10:51 - INFO - ci_test]: assert create collection: 0.013302326202392578, init_entities: 194352 (test_all_collections_after_chaos.py:49)

[2024-11-20T09:13:47.476Z] [2024-11-20 09:10:54 - DEBUG - ci_test]: (api_request)  : [Collection.insert] args: [[{'int64': -3000, 'float': 0.1309169, 'varchar': 'gocyi', 'text': 'Too goal close go. Discussion many hot practice former. Full risk notice chance bit seat.\nReport music pressure cut nature. Doctor by mind according issue under middle.\nAlone cold rule.', 'json_field': {'name': 'Meghan Gutierrez',......, kwargs: {'timeout': 180} (api_request.py:62)

[2024-11-20T09:13:47.476Z] [2024-11-20 09:10:55 - DEBUG - ci_test]: (api_response) : (insert count: 2000, delete count: 0, upsert count: 0, timestamp: 454058011668250625, success count: 2000, err count: 0  (api_request.py:37)

[2024-11-20T09:13:47.476Z] [2024-11-20 09:10:55 - INFO - ci_test]: assert insert: 1.4796831607818604 (test_all_collections_after_chaos.py:57)

[2024-11-20T09:13:47.476Z] [2024-11-20 09:10:55 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:62)

[2024-11-20T09:13:47.476Z] [2024-11-20 09:11:12 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)

[2024-11-20T09:13:47.476Z] [2024-11-20 09:11:12 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:62)

[2024-11-20T09:13:47.476Z] [2024-11-20 09:11:31 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)

[2024-11-20T09:13:47.476Z] [2024-11-20 09:11:31 - INFO - ci_test]: assert flush: 16.213988065719604, entities: 196352 (test_all_collections_after_chaos.py:67)

[2024-11-20T09:13:47.476Z] [2024-11-20 09:11:31 - INFO - ci_test]: index info: [{'collection': 'Checker__HyvbZlCy', 'field': 'int64', 'index_name': 'int64', 'index_param': {'index_type': 'INVERTED'}}, {'collection': 'Checker__HyvbZlCy', 'field': 'float', 'index_name': 'float', 'index_param': {'index_type': 'INVERTED'}}, {'collection': 'Checker__HyvbZlCy', 'field': 'varchar', 'index_name': 'varchar', 'index_param': {'index_type': 'INVERTED'}}, {'collection': 'Checker__HyvbZlCy', 'field': 'text', 'index_name': 'text', 'index_param': {'index_type': 'INVERTED'}}, {'collection': 'Checker__HyvbZlCy', 'field': 'float_vector', 'index_name': 'float_vector', 'index_param': {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 48, 'efConstruction': 500}}}, {'collection': 'Checker__HyvbZlCy', 'field': 'image_emb', 'index_name': 'image_emb', 'index_param': {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 48, 'efConstruction': 500}}}, {'collection': 'Checker__HyvbZlCy', 'field': 'voice_emb', 'index_name': 'voice_emb', 'index_param': {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 48, 'efConstruction': 500}}}, {'collection': 'Checker__HyvbZlCy', 'field': 'text_sparse_emb', 'index_name': 'text_sparse_emb', 'index_param': {'index_type': 'SPARSE_INVERTED_INDEX', 'metric_type': 'BM25', 'params': {'bm25_k1': 1.5, 'bm25_b': 0.75}}}] (test_all_collections_after_chaos.py:71)

[2024-11-20T09:13:47.476Z] [2024-11-20 09:11:31 - INFO - ci_test]: index info: [{'collection': 'Checker__HyvbZlCy', 'field': 'float_vector', 'index_name': 'float_vector', 'index_param': {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 48, 'efConstruction': 500}}}, {'collection': 'Checker__HyvbZlCy', 'field': 'image_emb', 'index_name': 'image_emb', 'index_param': {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 48, 'efConstruction': 500}}}, {'collection': 'Checker__HyvbZlCy', 'field': 'voice_emb', 'index_name': 'voice_emb', 'index_param': {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 48, 'efConstruction': 500}}}, {'collection': 'Checker__HyvbZlCy', 'field': 'text_sparse_emb', 'index_name': 'text_sparse_emb', 'index_param': {'index_type': 'SPARSE_INVERTED_INDEX', 'metric_type': 'BM25', 'params': {'bm25_k1': 1.5, 'bm25_b': 0.75}}}, {'collection': 'Checker__HyvbZlCy', 'field': 'int64', 'index_name': 'int64', 'index_param': {'index_type': 'INVERTED'}}, {'collection': 'Checker__HyvbZlCy', 'field': 'float', 'index_name': 'float', 'index_param': {'index_type': 'INVERTED'}}, {'collection': 'Checker__HyvbZlCy', 'field': 'varchar', 'index_name': 'varchar', 'index_param': {'index_type': 'INVERTED'}}, {'collection': 'Checker__HyvbZlCy', 'field': 'text', 'index_name': 'text', 'index_param': {'index_type': 'INVERTED'}}] (test_all_collections_after_chaos.py:86)

[2024-11-20T09:13:47.476Z] [2024-11-20 09:11:31 - DEBUG - ci_test]: (api_request)  : [Collection.load] args: [None, 1, 180], kwargs: {} (api_request.py:62)

[2024-11-20T09:13:47.476Z] [2024-11-20 09:11:31 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)

[2024-11-20T09:13:47.476Z] [2024-11-20 09:11:31 - DEBUG - ci_test]: (api_request)  : [Collection.search] args: [[[0.038520660424703645, 0.02706883160875182, 0.11491212846102333, 0.1251422882317869, 0.15336852226710182, 0.05243946725633598, 0.04315610567137004, 0.019213663000552206, 0.038231163385241455, 0.1427314018911522, 0.034942312377116536, 0.10038952282770104, 0.13653709172576833, 0.006019849111047318, ......, kwargs: {} (api_request.py:62)

[2024-11-20T09:13:47.476Z] [2024-11-20 09:11:52 - ERROR - pymilvus.decorators]: RPC error: [search], <MilvusException: (code=503, message=failed to search: service internal error: target version mismatch, collection: 454057490543884941, channel: by-dev-rootcoord-dml_8_454057490543884941v1,  current target version: 1732093296790336169, leader version: 0: channel not available[channel=by-dev-rootcoord-dml_8_454057490543884941v1])>, <Time:{'RPC start': '2024-11-20 09:11:31.582876', 'RPC error': '2024-11-20 09:11:52.599095'}> (decorators.py:140)

[2024-11-20T09:13:47.476Z] [2024-11-20 09:11:52 - ERROR - ci_test]: Traceback (most recent call last):

[2024-11-20T09:13:47.476Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper

[2024-11-20T09:13:47.476Z]     res = func(*args, **_kwargs)

[2024-11-20T09:13:47.476Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request

[2024-11-20T09:13:47.476Z]     return func(*arg, **kwargs)

[2024-11-20T09:13:47.476Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 801, in search

[2024-11-20T09:13:47.476Z]     resp = conn.search(

[2024-11-20T09:13:47.476Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 141, in handler

[2024-11-20T09:13:47.476Z]     raise e from e

[2024-11-20T09:13:47.476Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 137, in handler

[2024-11-20T09:13:47.476Z]     return func(*args, **kwargs)

[2024-11-20T09:13:47.476Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 176, in handler

[2024-11-20T09:13:47.476Z]     return func(self, *args, **kwargs)

[2024-11-20T09:13:47.476Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 116, in handler

[2024-11-20T09:13:47.476Z]     raise e from e

[2024-11-20T09:13:47.476Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 86, in handler

[2024-11-20T09:13:47.476Z]     return func(*args, **kwargs)

[2024-11-20T09:13:47.476Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 806, in search

[2024-11-20T09:13:47.476Z]     return self._execute_search(request, timeout, round_decimal=round_decimal, **kwargs)

[2024-11-20T09:13:47.476Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 747, in _execute_search

[2024-11-20T09:13:47.476Z]     raise e from e

[2024-11-20T09:13:47.476Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 736, in _execute_search

[2024-11-20T09:13:47.477Z]     check_status(response.status)

[2024-11-20T09:13:47.477Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/utils.py", line 63, in check_status

[2024-11-20T09:13:47.477Z]     raise MilvusException(status.code, status.reason, status.error_code)

[2024-11-20T09:13:47.477Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=503, message=failed to search: service internal error: target version mismatch, collection: 454057490543884941, channel: by-dev-rootcoord-dml_8_454057490543884941v1,  current target version: 1732093296790336169, leader version: 0: channel not available[channel=by-dev-rootcoord-dml_8_454057490543884941v1])>

[2024-11-20T09:13:47.477Z]  (api_request.py:45)

[2024-11-20T09:13:47.477Z] [2024-11-20 09:11:52 - ERROR - ci_test]: (api_response) : <MilvusException: (code=503, message=failed to search: service internal error: target version mismatch, collection: 454057490543884941, channel: by-dev-rootcoord-dml_8_454057490543884941v1,  current target version: 1732093296790336169, leader version: 0: channel not available[channel=by-dev-rootcoor...... (api_request.py:46)

[2024-11-20T09:13:47.477Z] ------------- generated html file: file:///tmp/ci_logs/report.html -------------

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/19549/pipeline log: artifacts-querynode-pod-failure-19549-server-logs.tar.gz

[2024-11-20T09:10:07.229Z] + kubectl get pods -o wide

[2024-11-20T09:10:07.230Z] + grep querynode-pod-failure-19549

[2024-11-20T09:10:07.487Z] querynode-pod-failure-19549-etcd-0                                1/1     Running            0                33m     10.104.18.193   4am-node25   <none>           <none>

[2024-11-20T09:10:07.487Z] querynode-pod-failure-19549-etcd-1                                1/1     Running            0                33m     10.104.34.13    4am-node37   <none>           <none>

[2024-11-20T09:10:07.487Z] querynode-pod-failure-19549-etcd-2                                1/1     Running            0                33m     10.104.26.154   4am-node32   <none>           <none>

[2024-11-20T09:10:07.487Z] querynode-pod-failure-19549-milvus-datanode-b9bb7b756-52zkd       1/1     Running            3 (32m ago)      33m     10.104.16.254   4am-node21   <none>           <none>

[2024-11-20T09:10:07.487Z] querynode-pod-failure-19549-milvus-datanode-b9bb7b756-nxvml       1/1     Running            3 (32m ago)      33m     10.104.4.128    4am-node11   <none>           <none>

[2024-11-20T09:10:07.487Z] querynode-pod-failure-19549-milvus-indexnode-547cb684f-92fjr      1/1     Running            3 (32m ago)      33m     10.104.30.73    4am-node38   <none>           <none>

[2024-11-20T09:10:07.488Z] querynode-pod-failure-19549-milvus-indexnode-547cb684f-sbm9r      1/1     Running            3 (32m ago)      33m     10.104.32.127   4am-node39   <none>           <none>

[2024-11-20T09:10:07.488Z] querynode-pod-failure-19549-milvus-indexnode-547cb684f-t5c5g      1/1     Running            3 (32m ago)      33m     10.104.20.180   4am-node22   <none>           <none>

[2024-11-20T09:10:07.488Z] querynode-pod-failure-19549-milvus-mixcoord-565dd7b6b7-ttbrv      1/1     Running            3 (32m ago)      33m     10.104.32.125   4am-node39   <none>           <none>

[2024-11-20T09:10:07.488Z] querynode-pod-failure-19549-milvus-proxy-55f46d47-9x4xd           1/1     Running            3 (32m ago)      33m     10.104.32.126   4am-node39   <none>           <none>

[2024-11-20T09:10:07.488Z] querynode-pod-failure-19549-milvus-querynode-77c888d8f9-54vls     1/1     Running            7 (9m18s ago)    33m     10.104.14.162   4am-node18   <none>           <none>

[2024-11-20T09:10:07.488Z] querynode-pod-failure-19549-milvus-querynode-77c888d8f9-8qnft     1/1     Running            7 (9m4s ago)     33m     10.104.17.171   4am-node23   <none>           <none>

[2024-11-20T09:10:07.488Z] querynode-pod-failure-19549-milvus-querynode-77c888d8f9-ztqpq     1/1     Running            8 (8m16s ago)    33m     10.104.4.127    4am-node11   <none>           <none>

[2024-11-20T09:10:07.488Z] querynode-pod-failure-19549-minio-0                               1/1     Running            0                33m     10.104.18.191   4am-node25   <none>           <none>

[2024-11-20T09:10:07.488Z] querynode-pod-failure-19549-minio-1                               1/1     Running            0                33m     10.104.34.12    4am-node37   <none>           <none>

[2024-11-20T09:10:07.488Z] querynode-pod-failure-19549-minio-2                               1/1     Running            0                33m     10.104.26.153   4am-node32   <none>           <none>

[2024-11-20T09:10:07.488Z] querynode-pod-failure-19549-minio-3                               1/1     Running            0                33m     10.104.19.138   4am-node28   <none>           <none>

[2024-11-20T09:10:07.488Z] querynode-pod-failure-19549-pulsar-bookie-0                       1/1     Running            0                33m     10.104.18.192   4am-node25   <none>           <none>

[2024-11-20T09:10:07.488Z] querynode-pod-failure-19549-pulsar-bookie-1                       1/1     Running            0                33m     10.104.34.14    4am-node37   <none>           <none>

[2024-11-20T09:10:07.488Z] querynode-pod-failure-19549-pulsar-bookie-init-4vmng              0/1     Completed          0                33m     10.104.18.176   4am-node25   <none>           <none>

[2024-11-20T09:10:07.488Z] querynode-pod-failure-19549-pulsar-broker-0                       1/1     Running            0                33m     10.104.18.178   4am-node25   <none>           <none>

[2024-11-20T09:10:07.488Z] querynode-pod-failure-19549-pulsar-proxy-0                        1/1     Running            0                33m     10.104.14.163   4am-node18   <none>           <none>

[2024-11-20T09:10:07.488Z] querynode-pod-failure-19549-pulsar-pulsar-init-jtpts              0/1     Completed          0                33m     10.104.18.175   4am-node25   <none>           <none>

[2024-11-20T09:10:07.488Z] querynode-pod-failure-19549-pulsar-recovery-0                     1/1     Running            0                33m     10.104.18.177   4am-node25   <none>           <none>

[2024-11-20T09:10:07.488Z] querynode-pod-failure-19549-pulsar-zookeeper-0                    1/1     Running            0                33m     10.104.18.190   4am-node25   <none>           <none>

Anything else?

No response

zhuwenxing commented 1 day ago

https://github.com/milvus-io/milvus/issues/37765 This error also occurred in this issue. However, it was not reproducible during verification, so it was closed. This new issue has been opened to track this.

The key point of issue https://github.com/milvus-io/milvus/issues/37765 is that the querynode restarted during verification, and in this issue, only an error was reported without any indication of a restart.

yanliang567 commented 1 day ago

/assign @weiliu1031 /unassign

yanliang567 commented 1 day ago

@liliu-z please take a look at this issue as well

weiliu1031 commented 1 day ago

should be fixed by #37909