run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.76k stars 5.27k forks source link

[Bug]: Hybrid search for Milvus vector store fails #15081

Open dudeperf3ct opened 3 months ago

dudeperf3ct commented 3 months ago

Bug Description

The hybrid search for Milvus vector store is not working.

Version

0.10.58

Steps to Reproduce

Here's the following code that I am using

Data ingestion

reader = SimpleDirectoryReader(input_files=[file_path])
documents = reader.load_data()

Dense milvus vector store

vector_store = MilvusVectorStore(
    uri="./milvus_dense.db",
    dim=EMBED_DIM,
    overwrite=True,
    enable_sparse=False,
    similarity_metric=DENSE_DISTANCE_METRICS,
)

embed_model = HuggingFaceEmbedding(model_name=DENSE_MODEL, device=DEVICE)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(documents, storage_context=storage_context, embed_model=embed_model, show_progress=True)

dense_retriever = index.as_retriever(similarity_top_k=TOP_K)

The above code works correctly and I can retrieve the closest documents to the query.

Hybrid/Sparse milvus vector store

vector_store = MilvusVectorStore(
    uri='./milvus_sparse.db',
    dim=EMBED_DIM,
    overwrite=True,
    enable_sparse=True,
    hybrid_ranker="WeightedRanker",
    hybrid_ranker_params = {"weights": [0.0, 1.0]}
)

embed_model = HuggingFaceEmbedding(model_name=DENSE_MODEL, device=DEVICE)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(documents, storage_context=storage_context, embed_model=embed_model, show_progress=True)

This fails with the following error

File ~/.cache/pypoetry/virtualenvs/rag-search-retrieval-hVw0f43b-py3.10/lib/python3.10/site-packages/pymilvus/client/entity_helper.py:311, in pack_field_value_to_field_data(field_value, field_data, field_info)
    309         raise ParamError(message="invalid input for sparse float vector: expect 1 row")
    310     if not entity_is_sparse_matrix(field_value):
--> 311         raise ParamError(message="invalid input for sparse float vector")
    312     field_data.vectors.sparse_float_vector.contents.append(
    313         sparse_rows_to_proto(field_value).contents[0]
    314     )
    315 elif field_type == DataType.VARCHAR:

ParamError: <ParamError: (code=1, message=invalid input for sparse float vector)>

I tested the example code for Milvus hybrid vector search, it works for that dataset.

Not sure why the default sparse embedding function is not working.

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 3 months ago

Hey @dudeperf3ct! I'm here to help you with any bugs, questions, or contributions you have related to this repository. Let me know how I can assist you!

The error ParamError: (code=1, message=invalid input for sparse float vector) is likely due to the sparse embedding field not being correctly configured or the sparse embedding function not being properly provided.

To fix this, ensure that the enable_sparse flag is set to True and that a valid sparse_embedding_function is provided. Additionally, make sure that the collection schema includes the sparse_embedding_field with the correct data type (SPARSE_FLOAT_VECTOR).

Here's an example of how to correctly set up the MilvusVectorStore with sparse embeddings:

from llama_index.vector_stores.milvus import MilvusVectorStore
from FlagEmbedding import BGEM3FlagModel
from typing import List
from llama_index.vector_stores.milvus.utils import BaseSparseEmbeddingFunction

class ExampleEmbeddingFunction(BaseSparseEmbeddingFunction):
    def __init__(self):
        self.model = BGEM3FlagModel("BAAI/bge-m3", use_fp16=False)

    def encode_queries(self, queries: List[str]):
        outputs = self.model.encode(
            queries,
            return_dense=False,
            return_sparse=True,
            return_colbert_vecs=False,
        )["lexical_weights"]
        return [self._to_standard_dict(output) for output in outputs]

    def encode_documents(self, documents: List[str]):
        outputs = self.model.encode(
            documents,
            return_dense=False,
            return_sparse=True,
            return_colbert_vecs=False,
        )["lexical_weights"]
        return [self._to_standard_dict(output) for output in outputs]

    def _to_standard_dict(self, raw_output):
        result = {}
        for k in raw_output:
            result[int(k)] = raw_output[k]
        return result

# Setup MilvusVectorStore with sparse embeddings enabled
vector_store = MilvusVectorStore(
    dim=1536,
    collection_name="your_collection_name",
    uri="http://milvus_address:port",
    token="your_milvus_token_here",
    overwrite=True,
    enable_sparse=True,
    sparse_embedding_function=ExampleEmbeddingFunction(),  # Provide your sparse embedding function
    hybrid_ranker="WeightedRanker",  # or "RRFRanker"
    hybrid_ranker_params={"weights": [1.0, 1.0]}  # or {"k": 60} for RRFRanker
)

Ensure that the ExampleEmbeddingFunction class implements the necessary methods to convert text to a sparse embedding. This setup should resolve the ParamError and allow the hybrid search to function correctly [1][2].

To continue talking to Dosu, mention @dosu.

202030481266 commented 1 month ago

same. I solved this by not using the splitter function.

milvus_simple_vector_index = VectorStoreIndex.from_documents(
    documents=[Document(text=source_docx_text, metadata={"file_name": docx_client.docx_file_path})], storage_context=storage_context)

But I wonder why it fails with nodes or documents(splitted text).

asisle commented 1 week ago

I am experiencing the same with splitted documents, but using langchain not llama_index. therefore this is defo issue with the Milvus native SDK. Did you find a work-around for chunked documents? the fulll documents are just a bit too long I am afraid.