milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.07k stars 2.88k forks source link

[Bug]: Query failed with error `fail to Query, QueryNode ID = 19, reason=target node id not match target id = 19, node id = 27` after reinstallation #23594

Closed zhuwenxing closed 1 year ago

zhuwenxing commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version:master-20230420-935d79c9
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):pulsar    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

[2023-04-20T10:37:51.596Z] self = <pymilvus.client.grpc_handler.GrpcHandler object at 0x7fe3bdc3ff40>

[2023-04-20T10:37:51.596Z] collection_name = 'deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_only_growing_is_string_indexed_not_string_indexed_replica_number_1_is_deleted_is_deleted_data_size_3000'

[2023-04-20T10:37:51.596Z] expr = 'int64 in [0, 1]', output_fields = ['int64'], partition_names = None

[2023-04-20T10:37:51.596Z] timeout = None

[2023-04-20T10:37:51.596Z] kwargs = {'check_task': 'check_query_not_empty', 'guarantee_timestamp': 0, 'schema': {'auto_id': False, 'consistency_level': 0,...R: 21>}, {'description': '', 'name': 'binary_vector', 'params': {'dim': 128}, 'type': <DataType.BINARY_VECTOR: 100>}]}}

[2023-04-20T10:37:51.596Z] collection_schema = {'auto_id': False, 'consistency_level': 0, 'description': '', 'fields': [{'auto_id': False, 'description': '', 'is_pri...AR: 21>}, {'description': '', 'name': 'binary_vector', 'params': {'dim': 128}, 'type': <DataType.BINARY_VECTOR: 100>}]}

[2023-04-20T10:37:51.596Z] consistency_level = 0

[2023-04-20T10:37:51.596Z] request = collection_name: "deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_only_growing_is_string_...ta_size_3000"

[2023-04-20T10:37:51.596Z] expr: "int64 in [0, 1]"

[2023-04-20T10:37:51.596Z] output_fields: "int64"

[2023-04-20T10:37:51.596Z] query_params {

[2023-04-20T10:37:51.596Z]   key: "ignore_growing"

[2023-04-20T10:37:51.596Z]   value: "False"

[2023-04-20T10:37:51.596Z] }

[2023-04-20T10:37:51.596Z] 

[2023-04-20T10:37:51.596Z] future = <_MultiThreadedRendezvous of RPC that terminated with:

[2023-04-20T10:37:51.596Z]  status = StatusCode.OK

[2023-04-20T10:37:51.596Z]  details = ""

[2023-04-20T10:37:51.596Z] >

[2023-04-20T10:37:51.596Z] response = status {

[2023-04-20T10:37:51.596Z]   error_code: UnexpectedError

[2023-04-20T10:37:51.596Z]   reason: "fail to query on all shard leaders, err=fail to Query, QueryNode ID = 19, reason=target node id not match target id = 19, node id = 27"

[2023-04-20T10:37:51.596Z] }

[2023-04-20T10:37:51.596Z] 

[2023-04-20T10:37:51.596Z] 

[2023-04-20T10:37:51.596Z]     @retry_on_rpc_failure()

[2023-04-20T10:37:51.596Z]     def query(self, collection_name, expr, output_fields=None, partition_names=None, timeout=None, **kwargs):

[2023-04-20T10:37:51.596Z]         if output_fields is not None and not isinstance(output_fields, (list,)):

[2023-04-20T10:37:51.596Z]             raise ParamError(message="Invalid query format. 'output_fields' must be a list")

[2023-04-20T10:37:51.596Z]         collection_schema = kwargs.get("schema", None)

[2023-04-20T10:37:51.596Z]         if not collection_schema:

[2023-04-20T10:37:51.596Z]             collection_schema = self.describe_collection(collection_name, timeout)

[2023-04-20T10:37:51.596Z]         consistency_level = collection_schema["consistency_level"]

[2023-04-20T10:37:51.596Z]         # overwrite the consistency level defined when user created the collection

[2023-04-20T10:37:51.596Z]         consistency_level = get_consistency_level(kwargs.get("consistency_level", consistency_level))

[2023-04-20T10:37:51.596Z]     

[2023-04-20T10:37:51.596Z]         ts_utils.construct_guarantee_ts(consistency_level, collection_name, kwargs)

[2023-04-20T10:37:51.596Z]         request = Prepare.query_request(collection_name, expr, output_fields, partition_names, **kwargs)

[2023-04-20T10:37:51.596Z]     

[2023-04-20T10:37:51.596Z]         future = self._stub.Query.future(request, timeout=timeout)

[2023-04-20T10:37:51.596Z]         response = future.result()

[2023-04-20T10:37:51.596Z]         if response.status.error_code == Status.EMPTY_COLLECTION:

[2023-04-20T10:37:51.596Z]             return []

[2023-04-20T10:37:51.596Z]         if response.status.error_code != Status.SUCCESS:

[2023-04-20T10:37:51.596Z] >           raise MilvusException(response.status.error_code, response.status.reason)

[2023-04-20T10:37:51.596Z] E           pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to query on all shard leaders, err=fail to Query, QueryNode ID = 19, reason=target node id not match target id = 19, node id = 27)>

[2023-04-20T10:37:51.596Z] 

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed job: deploy_test_cron/666 log: artifacts-pulsar-cluster-reinstall-666-server-logs.tar.gz artifacts-pulsar-cluster-reinstall-666-pytest-logs.tar.gz

Anything else?

No response

yanliang567 commented 1 year ago

/assign @jiaoew1991 /unassign

smellthemoon commented 1 year ago

/unassign @jiaoew1991 /assign

smellthemoon commented 1 year ago

It seems that pulsar is started in exclusive mode, which means, only 1 consumer is allowed as active, but two consumers are created. Can you tell me the specific settings or provide a method to get the specific settings of pulsar to help me verify this conclusion? @zhuwenxing

1
smellthemoon commented 1 year ago
2

Try to clean up subscriptions, but failed. It is considered to be a network problem between milvus and pulsar. Is this problem recurring stably? @zhuwenxing

xiaofan-luan commented 1 year ago
2

Try to clean up subscriptions, but failed. It is considered to be a network problem between milvus and pulsar. Is this problem recurring stably? @zhuwenxing

We should think of use shared mode in all subscription rather than exclusive mode. And we need to figure out a way to clean subscription when server dead. I thought there is already a mechanism to cleanup expired subscriptions right?

zhuwenxing commented 1 year ago

Is this problem recurring stably?

@smellthemoon Not a stable issue

smellthemoon commented 1 year ago

a network problem between milvus and pulsar. /assign @zhuwenxing /unassign