milvus-io / milvus-lite

A lightweight version of Milvus
Apache License 2.0
267 stars 30 forks source link

[Bug]: the limit parameter of hybrid_search() does not have any effect #189

Closed tombolano closed 2 weeks ago

tombolano commented 2 months ago

I am using Milvus Lite 2.4.8 and pymilvus 2.4.4.

This bug can be seen with the exemplary code that appears in the Milvus hybrid search documentation (https://milvus.io/docs/multi-vector-search.md). Here is the complete code, where I changed the connect call to use a local file, and the index type to HNSW:

from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType, AnnSearchRequest, WeightedRanker
import random

# Connect to Milvus
connections.connect(uri="test.db")

# Create schema
fields = [
    FieldSchema(name="film_id", dtype=DataType.INT64, is_primary=True),
    FieldSchema(name="filmVector", dtype=DataType.FLOAT_VECTOR, dim=5), # Vector field for film vectors
    FieldSchema(name="posterVector", dtype=DataType.FLOAT_VECTOR, dim=5)] # Vector field for poster vectors

schema = CollectionSchema(fields=fields,enable_dynamic_field=False)

# Create collection
collection = Collection(name="test_collection", schema=schema)

# Create index for each vector field
index_params = {
    "metric_type": "L2",
    "index_type": "HNSW",
    "params": {"nlist": 128},
}

collection.create_index("filmVector", index_params)
collection.create_index("posterVector", index_params)

# Generate random entities to insert
entities = []

for _ in range(1000):
    # generate random values for each field in the schema
    film_id = random.randint(1, 1000)
    film_vector = [ random.random() for _ in range(5) ]
    poster_vector = [ random.random() for _ in range(5) ]

    # create a dictionary for each entity
    entity = {
        "film_id": film_id,
        "filmVector": film_vector,
        "posterVector": poster_vector
    }

    # add the entity to the list
    entities.append(entity)

collection.insert(entities)

# Create ANN search request 1 for filmVector
query_filmVector = [[0.8896863042430693, 0.370613100114602, 0.23779315077113428, 0.38227915951132996, 0.5997064603128835]]

search_param_1 = {
    "data": query_filmVector, # Query vector
    "anns_field": "filmVector", # Vector field name
    "param": {
        "metric_type": "L2", # This parameter value must be identical to the one used in the collection schema
        "params": {"nprobe": 10}
    },
    "limit": 2 # Number of search results to return in this AnnSearchRequest
}
request_1 = AnnSearchRequest(**search_param_1)

# Create ANN search request 2 for posterVector
query_posterVector = [[0.02550758562349764, 0.006085637357292062, 0.5325251250159071, 0.7676432650114147, 0.5521074424751443]]
search_param_2 = {
    "data": query_posterVector, # Query vector
    "anns_field": "posterVector", # Vector field name
    "param": {
        "metric_type": "L2", # This parameter value must be identical to the one used in the collection schema
        "params": {"nprobe": 10}
    },
    "limit": 2 # Number of search results to return in this AnnSearchRequest
}
request_2 = AnnSearchRequest(**search_param_2)

# Use WeightedRanker to combine results with specified weights
# Assign weights of 0.8 to text search and 0.2 to image search
rerank = WeightedRanker(0.8, 0.2)

# Store these two requests as a list in `reqs`
reqs = [request_1, request_2]

# Before conducting hybrid search, load the collection into memory.
collection.load()

res = collection.hybrid_search(
    reqs, # List of AnnSearchRequests created in step 1
    rerank, # Reranking strategy specified in step 2
    limit=2 # Number of final search results to return
)

print(res)

The hybrid_search call in the code has the parameter limit=2, so res should have 2 entities and the output should be similar to the following as stated in the docs:

["['id: 844, distance: 0.006047376897186041, entity: {}', 'id: 876, distance: 0.006422005593776703, entity: {}']"]

However, when I run the code res has 4 entities and the output is similar to the following:

data: ["['id: 942, distance: 0.7877938151359558, entity: {}', 'id: 753, distance: 0.7855104804039001, entity: {}', 'id: 778, distance: 0.19718925654888153, entity: {}', 'id: 540, distance: 0.19272640347480774, entity: {}']"], cost: 0

I tried changing the number of queries and the limit parameter value, the code returns an error if setting limit=0, but otherwise the limit parameter does not have any effect in the result.

codingjaguar commented 2 weeks ago

@junjiejiangjjj is this fixed? if so please close the issue.