milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.32k stars 2.91k forks source link

[Bug]: MilvusCollectionHybridSearchRetriever error when query words not in BM25SparseEmbedding corpus #35803

Open longyunfeigu opened 2 months ago

longyunfeigu commented 2 months ago

Is there an existing issue for this?

Environment

- Milvus version: v2.4.0
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

pymilvus.exceptions.MilvusException: <MilvusException: (code=65535, message=fail to search on QueryNode 111: worker(111) query failed: Assert "static_cast(field_meta.get_data_type()) == static_cast(info.type())" at /go/src/github.com/milvus-io/milvus/internal/core/src/query/Plan.cpp:48 => vector type must be the same, field sparse_vector - type VECTOR_SPARSE_FLOAT, search info type VECTOR_FLOAT)>

Expected Behavior

No response

Steps To Reproduce

from langchain_milvus.retrievers import MilvusCollectionHybridSearchRetriever from langchain_milvus.utils.sparse import BM25SparseEmbedding from langchain_openai import ChatOpenAI, OpenAIEmbeddings from pymilvus import ( Collection, CollectionSchema, DataType, FieldSchema, WeightedRanker, connections, )

texts = [ "In 'The Whispering Walls' by Ava Moreno, a young journalist named Sophia uncovers a decades-old conspiracy hidden within the crumbling walls of an ancient mansion, where the whispers of the past threaten to destroy her own sanity.", "In 'The Last Refuge' by Ethan Blackwood, a group of survivors must band together to escape a post-apocalyptic wasteland, where the last remnants of humanity cling to life in a desperate bid for survival.", "In 'The Memory Thief' by Lila Rose, a charismatic thief with the ability to steal and manipulate memories is hired by a mysterious client to pull off a daring heist, but soon finds themselves trapped in a web of deceit and betrayal.", "In 'The City of Echoes' by Julian Saint Clair, a brilliant detective must navigate a labyrinthine metropolis where time is currency, and the rich can live forever, but at a terrible cost to the poor.", "In 'The Starlight Serenade' by Ruby Flynn, a shy astronomer discovers a mysterious melody emanating from a distant star, which leads her on a journey to uncover the secrets of the universe and her own heart.", "In 'The Shadow Weaver' by Piper Redding, a young orphan discovers she has the ability to weave powerful illusions, but soon finds herself at the center of a deadly game of cat and mouse between rival factions vying for control of the mystical arts.", "In 'The Lost Expedition' by Caspian Grey, a team of explorers ventures into the heart of the Amazon rainforest in search of a lost city, but soon finds themselves hunted by a ruthless treasure hunter and the treacherous jungle itself.", "In 'The Clockwork Kingdom' by Augusta Wynter, a brilliant inventor discovers a hidden world of clockwork machines and ancient magic, where a rebellion is brewing against the tyrannical ruler of the land.", "In 'The Phantom Pilgrim' by Rowan Welles, a charismatic smuggler is hired by a mysterious organization to transport a valuable artifact across a war-torn continent, but soon finds themselves pursued by deadly assassins and rival factions.", "In 'The Dreamwalker's Journey' by Lyra Snow, a young dreamwalker discovers she has the ability to enter people's dreams, but soon finds herself trapped in a surreal world of nightmares and illusions, where the boundaries between reality and fantasy blur.", ]

sparse_embedding_func = BM25SparseEmbedding(corpus=texts) CONNECTION_URI = "http://localhost:19530"

dense_embedding_func = OpenAIEmbeddings() connections.connect(uri=CONNECTION_URI) pk_field = "doc_id" dense_field = "dense_vector" sparse_field = "sparse_vector" text_field = "text" fields = [ FieldSchema( name=pk_field, dtype=DataType.VARCHAR, is_primary=True, auto_id=True, max_length=100, ), FieldSchema(name=dense_field, dtype=DataType.FLOAT_VECTOR, dim=1536), FieldSchema(name=sparse_field, dtype=DataType.SPARSE_FLOAT_VECTOR), FieldSchema(name=text_field, dtype=DataType.VARCHAR, max_length=65_535), ] schema = CollectionSchema(fields=fields, enable_dynamic_field=False) collection = Collection( name="in4", schema=schema )

dense_index = {"index_type": "FLAT", "metric_type": "IP"} collection.create_index("dense_vector", dense_index) sparse_index = {"index_type": "SPARSE_INVERTED_INDEX", "metric_type": "IP"} collection.create_index("sparse_vector", sparse_index) collection.flush()

entities = [] for text in texts: entity = { dense_field: dense_embedding_func.embed_documents([text])[0], sparse_field: sparse_embedding_func.embed_documents([text])[0], text_field: text, } entities.append(entity) collection.insert(entities) collection.load()

sparse_search_params = {"metric_type": "IP"} dense_search_params = {"metric_type": "IP", "params": {}} retriever = MilvusCollectionHybridSearchRetriever( collection=collection, rerank=WeightedRanker(0.5, 0.5), anns_fields=[dense_field, sparse_field], field_embeddings=[dense_embedding_func, sparse_embedding_func], field_search_params=[dense_search_params, sparse_search_params], top_k=3, text_field=text_field, )

from pprint import pprint

pprint(retriever.invoke("who are you?"))

Milvus Log

No response

Anything else?

No response

yanliang567 commented 2 months ago

/assign @zhengbuqian /unassign