milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
28.12k stars 2.71k forks source link

[Bug]: [hybrid_search] Unexpected error when "reqs" is an empty list for the interface "hybrid_search" #32288

Open binbinlv opened 2 months ago

binbinlv commented 2 months ago

Is there an existing issue for this?

Environment

- Milvus version:2.4 latest
- Deployment mode(standalone or cluster):both
- MQ type(rocksmq, pulsar or kafka):    all
- SDK version(e.g. pymilvus v2.0.0rc2): 2.4.1rc10
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Unexpected error when "reqs" is an empty list for the interface "hybrid_search"

(code=1, message=Unexpected error, message=<list index out of range>)> (api_request.py:46)

Expected Behavior

Report milvus error rather than unexpected error

when input the (maximum+1) value for nq, it report error "(code=65535, message=nq [16385] is invalid, nq (number of search vector per search request) should be in range [1, 16384], but got 16385"

It is better to keep the same error when nq = 0 (smaller than the maximum value 1)

Steps To Reproduce

@pytest.mark.tags(CaseLabel.L1)
    @pytest.mark.parametrize("nq", [0, 16385])
    def test_hybrid_search_normal_over_max_nq(self, nq):
        """
        target: test hybrid search normal case
        method: create connection, collection, insert and search
        expected: hybrid search successfully with limit(topK)
        """
        # 1. initialize collection with data
        collection_w = self.init_collection_general(prefix, True)[0]
        # 2. extract vector field name
        vector_name_list = cf.extract_vector_field_name_list(collection_w)
        vector_name_list.append(ct.default_float_vec_field_name)
        # 3. prepare search params
        req_list = []
        weights = [1]
        vectors = cf.gen_vectors_based_on_vector_type(nq, default_dim, "FLOAT_VECTOR")
        # 4. get hybrid search req list
        for i in range(len(vector_name_list)):
            search_param = {
                "data": vectors,
                "anns_field": vector_name_list[i],
                "param": {"metric_type": "COSINE"},
                "limit": default_limit,
                "expr": "int64 > 0"}
            req = AnnSearchRequest(**search_param)
            req_list.append(req)
        # 5. hybrid search
        err_msg = "nq (number of search vector per search request) should be in range [1, 16384]"
        collection_w.hybrid_search(req_list, WeightedRanker(*weights), default_limit,
                                   check_task=CheckTasks.err_res,
                                   check_items={"err_code": 65535,
                                                "err_msg": err_msg})

Milvus Log

No response

Anything else?

No response

xiaofan-luan commented 2 months ago

/assign @czs007

yanliang567 commented 2 months ago

/unassign

czs007 commented 2 months ago

File "/home/czs/pymilvus/pymilvus/decorators.py", line 143, in handler return func(*args, kwargs) File "/home/czs/pymilvus/pymilvus/decorators.py", line 182, in handler return func(self, *args, *kwargs) File "/home/czs/pymilvus/pymilvus/decorators.py", line 124, in handler raise e from e File "/home/czs/pymilvus/pymilvus/decorators.py", line 87, in handler return func(args, kwargs) File "/home/czs/pymilvus/pymilvus/client/grpc_handler.py", line 820, in hybrid_search search_request = Prepare.search_requests_with_expr( File "/home/czs/pymilvus/pymilvus/client/prepare.py", line 612, in search_requests_with_expr elif isinstance(data[0], bytes): IndexError: list index out of range

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "hybrid_search3.py", line 44, in hybrid_res = hello_milvus.hybrid_search(req_list, WeightedRanker(*weights), default_limit, output_fields=["random"]) File "/home/czs/pymilvus/pymilvus/orm/collection.py", line 936, in hybrid_search resp = conn.hybrid_search( File "/home/czs/pymilvus/pymilvus/decorators.py", line 165, in handler raise MilvusException(message=f"Unexpected error, message=<{e!s}>") from e pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=Unexpected error, message=)>

it is client raised excpetion

stale[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

xxxfzxxx commented 1 week ago

I got the same error, how to resolve?

xiaofan-luan commented 1 week ago

@czs007 is this still a issue?i'm assuming this is becasue some parameter wrong ?

xxxfzxxx commented 1 week ago

search_param_dense = { "data": dense_embeddings, "anns_field": "dense_vector", "param": { "metric_type": "COSINE", "params": {"nprobe": 10} }, "limit": 100 # TODO hybrid search bug https://github.com/milvus-io/milvus/issues/32288 } search_param_sparse = { "data": sparse_embeddings, "anns_field": "sparse_vector", "param": { "metric_type": "IP", "params": {"nprobe": 10} }, "limit": 100 # TODO } I used to set the limit to col.num_entities, it was 24007. and it says the range should be within [1, 16835].

czs007 commented 1 week ago

@xxxfzxxx please try the latest 2.4.4 pymilvus

For a search, the limit indeed cannot exceed 16484.

xxxfzxxx commented 1 week ago

Is there a reason?

czs007 commented 1 week ago

@xxxfzxxx The conventional search lacks an iterative interface. We incorporate a limit constraint to avoid returning an excessive amount of data at once, thus preventing OOM (Out of Memory) errors.

Has the issue mentioned in the error message been resolved after upgrading pymilvus?

xxxfzxxx commented 1 week ago

"Has the issue mentioned in the error message been resolved after upgrading pymilvus?"

NO.

xxxfzxxx commented 1 week ago

I don't understand what is the difference between limit: 10 and limit: 1000. Because you will eventually calculate the similarity scores across all entities and select top 10, or top 1000. Why the limit matters here?

I am using a hybrid search, I would like to search across all entities and find the top k by a WeightedRanker(0.4, 0.6). But, if I limit 10, sometimes the retrieved "sparse" entities are not overlapped with the "dense" entities. Then how to address this case?

xiaofan-luan commented 1 week ago
  1. the large the topk ,the performance will be worth.
  2. milvus split data into small segemnts, each segemnts has roughly 100k-1m data. if topk is get closer to segment size, index won't work.

you can user range search to get more vector