milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
31.05k stars 2.95k forks source link

[Bug]: Query failed with error `code=1, message=No fields returned` after upgrade from v2.2.14 to 2.2.0-20231113-9654d323 #28405

Closed zhuwenxing closed 1 year ago

zhuwenxing commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version:v2.2.14 to 2.2.0-20231113-9654d323
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka): kafka   
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior


[2023-11-13T15:11:11.528Z] self = <pymilvus.client.grpc_handler.GrpcHandler object at 0x7ff7087d2eb0>

[2023-11-13T15:11:11.528Z] collection_name = 'deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_only_growing_is_string_indexed_is_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000'

[2023-11-13T15:11:11.528Z] expr = 'int64 in [0, 1]', output_fields = ['int64'], partition_names = None

[2023-11-13T15:11:11.528Z] timeout = None

[2023-11-13T15:11:11.528Z] kwargs = {'check_task': 'check_query_not_empty', 'schema': {'auto_id': False, 'consistency_level': 0, 'description': '', 'field...R: 21>}, {'description': '', 'name': 'binary_vector', 'params': {'dim': 128}, 'type': <DataType.BINARY_VECTOR: 100>}]}}

[2023-11-13T15:11:11.528Z] request = collection_name: "deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_only_growing_is_string_...wing"

[2023-11-13T15:11:11.528Z]   value: "False"

[2023-11-13T15:11:11.528Z] }

[2023-11-13T15:11:11.528Z] query_params {

[2023-11-13T15:11:11.528Z]   key: "reduce_stop_for_best"

[2023-11-13T15:11:11.528Z]   value: "False"

[2023-11-13T15:11:11.528Z] }

[2023-11-13T15:11:11.528Z] use_default_consistency: true

[2023-11-13T15:11:11.528Z] 

[2023-11-13T15:11:11.528Z] future = <_MultiThreadedRendezvous of RPC that terminated with:

[2023-11-13T15:11:11.528Z]  status = StatusCode.OK

[2023-11-13T15:11:11.528Z]  details = ""

[2023-11-13T15:11:11.528Z] >

[2023-11-13T15:11:11.528Z] response = status {

[2023-11-13T15:11:11.528Z]   error_code: UnexpectedError

[2023-11-13T15:11:11.528Z]   reason: "fail to query on all shard leaders, err=All attempts results:\nattem...failed, reason query shard(channel)  by-dev-rootcoord-dml_0_445614781037316785v0  does not exist\n err %!w(<nil>)\n"

[2023-11-13T15:11:11.528Z] }

[2023-11-13T15:11:11.528Z] 

[2023-11-13T15:11:11.528Z] num_fields = 0

[2023-11-13T15:11:11.528Z] 

[2023-11-13T15:11:11.528Z]     @retry_on_rpc_failure()

[2023-11-13T15:11:11.528Z]     def query(

[2023-11-13T15:11:11.528Z]         self,

[2023-11-13T15:11:11.528Z]         collection_name: str,

[2023-11-13T15:11:11.528Z]         expr: str,

[2023-11-13T15:11:11.528Z]         output_fields: Optional[List[str]] = None,

[2023-11-13T15:11:11.528Z]         partition_names: Optional[List[str]] = None,

[2023-11-13T15:11:11.528Z]         timeout: Optional[float] = None,

[2023-11-13T15:11:11.528Z]         **kwargs,

[2023-11-13T15:11:11.528Z]     ):

[2023-11-13T15:11:11.528Z]         if output_fields is not None and not isinstance(output_fields, (list,)):

[2023-11-13T15:11:11.528Z]             raise ParamError(message="Invalid query format. 'output_fields' must be a list")

[2023-11-13T15:11:11.528Z]         request = Prepare.query_request(

[2023-11-13T15:11:11.528Z]             collection_name, expr, output_fields, partition_names, **kwargs

[2023-11-13T15:11:11.528Z]         )

[2023-11-13T15:11:11.528Z]     

[2023-11-13T15:11:11.528Z]         future = self._stub.Query.future(request, timeout=timeout)

[2023-11-13T15:11:11.528Z]         response = future.result()

[2023-11-13T15:11:11.528Z]         if response.status.code == Status.EMPTY_COLLECTION:

[2023-11-13T15:11:11.528Z]             return []

[2023-11-13T15:11:11.528Z]         if response.status.code != Status.SUCCESS:

[2023-11-13T15:11:11.528Z]             raise MilvusException(

[2023-11-13T15:11:11.528Z]                 response.status.code, response.status.reason, response.status.error_code

[2023-11-13T15:11:11.528Z]             )

[2023-11-13T15:11:11.528Z]     

[2023-11-13T15:11:11.528Z]         num_fields = len(response.fields_data)

[2023-11-13T15:11:11.528Z]         # check has fields

[2023-11-13T15:11:11.528Z]         if num_fields == 0:

[2023-11-13T15:11:11.528Z] >           raise MilvusException(message="No fields returned")

[2023-11-13T15:11:11.528Z] E           pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=No fields returned)>

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_kafka_for_release_cron/detail/deploy_test_kafka_for_release_cron/1575/pipeline

log: artifacts-kafka-cluster-upgrade-1575-server-second-deployment-logs.tar.gz artifacts-kafka-cluster-upgrade-1575-server-first-deployment-logs.tar.gz

Anything else?

collection name: deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_only_growing_is_string_indexed_is_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000

yanliang567 commented 1 year ago

/assign @jaime0815 /unassign

jaime0815 commented 1 year ago

/assign @weiliu1031

bigsheeper commented 1 year ago

balancer triggers a channel task:

[2023/11/13 15:07:39.793 +00:00] [INFO] [balance/utils.go:98] ["Create Channel task"] [collection=445614781037316785] [replica=445615151646769279] [channel=by-dev-rootcoord-dml_0_445614781037316785v0] [From=-1] [To=15]

channel was watched successfully:

[2023/11/13 15:07:40.300 +00:00] [INFO] [querynode/impl.go:391] ["successfully watchDmChannelsTask"] [collectionID=445614781037316785] [nodeID=15] [channels="[by-dev-rootcoord-dml_0_445614781037316785v0]"]

However, query to node 22 failed, query shard not exist:

[2023/11/13 15:07:44.738 +00:00] [ERROR] [querynode/shard_cluster.go:896] ["Query 22 failed, reason query shard(channel)  by-dev-rootcoord-dml_0_445614781037316785v0  does not exist\n err %!w(<nil>)"] [stack="github.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Query\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/shard_cluster.go:896\ngithub.com/milvus-io/milvus/internal/querynode.(*QueryNode).queryWithDmlChannel\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/impl.go:1158\ngithub.com/milvus-io/milvus/internal/querynode.(*QueryNode).Query.func1\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/impl.go:1243\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/go/pkg/mod/golang.org/x/sync@v0.1.0/errgroup/errgroup.go:75"]
weiliu1031 commented 1 year ago

balancer triggers a channel-release task:

[2023/11/13 15:07:39.793 +00:00] [INFO] [balance/utils.go:98] ["Create Channel task"] [collection=445614781037316785] [replica=445615151646769279] [channel=by-dev-rootcoord-dml_0_445614781037316785v0] [From=-1] [To=15]

channel was released successfully:

[2023/11/13 15:07:44.288 +00:00] [INFO] [querynode/impl.go:459] ["unsubDmChannelTask WaitToFinish done"] [traceID=fc78855fd009402] [collectionID=445614781037316785] [channel=by-dev-rootcoord-dml_0_445614781037316785v0]

This lead to query failed, query shard not exist:

[2023/11/13 15:07:44.738 +00:00] [ERROR] [querynode/shard_cluster.go:896] ["Query 22 failed, reason query shard(channel)  by-dev-rootcoord-dml_0_445614781037316785v0  does not exist\n err %!w(<nil>)"] [stack="github.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Query\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/shard_cluster.go:896\ngithub.com/milvus-io/milvus/internal/querynode.(*QueryNode).queryWithDmlChannel\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/impl.go:1158\ngithub.com/milvus-io/milvus/internal/querynode.(*QueryNode).Query.func1\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/impl.go:1243\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/go/pkg/mod/golang.org/x/sync@v0.1.0/errgroup/errgroup.go:75"]

So, why balancer would triggers a channel-release task? @weiliu1031

the search/query failed caused by channel has been unwatched, but it's not a balance task, it's much more complicated than balance.

  1. in first load collection with 2 replicas, qc load channelv0 in two querynode
  2. in first release, it just release one of the loaded channelv0, then return release collection succeed
  3. in second release, it did nothing due to colelction not loaded
  4. in second load collection with 1 replica, qc load channelv0 in another node, so here comes two channelv0 in same replica, then qc choose the old one to release, then some search/query failed after channel relealsed. because proxy update it's shard leader cache for every 3 seconds, so it will send search/query request for a few times.

the channel leak after release collection which is unexpcted, which is still in investigating

bigsheeper commented 1 year ago

target node id is not match, expected 15, actual 22 (which not sub channel)

[2023/11/13 15:07:47.893 +00:00] [WARN] [proxy/task_query.go:404] ["invalid shard leaders cache, updating shardleader caches and retry query"] [traceID=31629654ae7227d] [error="code: UnexpectedError, error: fail to Query, QueryNode ID = 15, reason=Query 22 failed, reason query shard(channel)  by-dev-rootcoord-dml_0_445614781037316785v0  does not exist\n err %!w(<nil>)"]
weiliu1031 commented 1 year ago

target node id is not match, expected 15, actual 22 (which not sub channel)

[2023/11/13 15:07:47.893 +00:00] [WARN] [proxy/task_query.go:404] ["invalid shard leaders cache, updating shardleader caches and retry query"] [traceID=31629654ae7227d] [error="code: UnexpectedError, error: fail to Query, QueryNode ID = 15, reason=Query 22 failed, reason query shard(channel)  by-dev-rootcoord-dml_0_445614781037316785v0  does not exist\n err %!w(<nil>)"]

22 is a worker, which has a leaked channel

weiliu1031 commented 1 year ago

concurrent release segment cause query shard service release before channel release

yanliang567 commented 1 year ago

fixed on v2.2.15