milvus-io / milvus-model

The embedding/reranking model zoo help user to convert their unstructured data into embeddings
Apache License 2.0
22 stars 16 forks source link

BM25实现方案的疑惑 #46

Open qianxianyang opened 1 week ago

qianxianyang commented 1 week ago

你好, milvus在实现BM25时,预计对文档通过(当前文档作为Query,其余文档作为Doc)实现当前文档的embedding化。在计算真实Query时,通过IDF获得了embedding向量,最终通过两个向量的内积作为相似度。 这种做法和原始BM25计算公式还是不太一样。 麻烦问下,这种实现的出发点是什么呢,不同实现的性能是多少呢? BM25公式

\text{BM25}(D, Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{\text{avgdl}})}
codingjaguar commented 1 week ago
Screenshot 2024-11-13 at 18 15 26

Basically when search, the score can be calculated carefully so that the dot product of query sparse vector and doc sparse vector is equivalent to the BM25 equation.

Milvus 2.5 that will be released in a week adds native native BM25 support and accept text as input (so that users don't need to calculate doc vector and query vector themselves).

xiaofan-luan commented 1 week ago

@qianxianyang Milvus 2.5 原生集成了BM25能力,cheers!

xiaofan-luan commented 1 week ago

本质上,corpus向量体现了TF,而query向量体现了queyr的TF和IDF

xiaofan-luan commented 1 week ago

我们即将放出Milvus 2.5 Versus ES的性能benchmark结果,从目前的测试结果看差不多差不多会有2-3倍的性能提升

qianxianyang commented 1 day ago

点赞,期待测试结果

codingjaguar commented 57 minutes ago

We have released Milvus 2.5 beta with the full text search feature available (https://github.com/milvus-io/milvus/releases/tag/v2.5.0-beta). The detailed documentation will be released soon, but here is a snippet:

from pymilvus import MilvusClient, DataType, Function, FunctionType

schema = MilvusClient.create_schema()

schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True, auto_id=True)
schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=1000, enable_analyzer=True)
schema.add_field(field_name="sparse", datatype=DataType.SPARSE_FLOAT_VECTOR)

bm25_function = Function(
    name="text_bm25_emb", # Function name
    input_field_names=["text"], # Name of the VARCHAR field containing raw text data
    output_field_names=["sparse"], # Name of the SPARSE_FLOAT_VECTOR field reserved to store generated embeddings
    function_type=FunctionType.BM25,
)

schema.add_function(bm25_function)

index_params = MilvusClient.prepare_index_params()

index_params.add_index(
    field_name="sparse",
    index_type="AUTOINDEX", 
    metric_type="BM25"
)

MilvusClient.create_collection(
    collection_name='demo', 
    schema=schema, 
    index_params=index_params
)

MilvusClient.insert('demo', [
    {'text': 'Artificial intelligence was founded as an academic discipline in 1956.'},
    {'text': 'Alan Turing was the first person to conduct substantial research in AI.'},
    {'text': 'Born in Maida Vale, London, Turing was raised in southern England.'},
])

search_params = {
    'params': {'drop_ratio_search': 0.6},
}

MilvusClient.search(
    collection_name='demo', 
    data=['Who started AI research?'],
    anns_field='sparse',
    limit=3,
    search_params=search_params
)

Feel free to check it out!