Open qianxianyang opened 1 week ago
Basically when search, the score can be calculated carefully so that the dot product of query sparse vector and doc sparse vector is equivalent to the BM25 equation.
Milvus 2.5 that will be released in a week adds native native BM25 support and accept text as input (so that users don't need to calculate doc vector and query vector themselves).
@qianxianyang Milvus 2.5 原生集成了BM25能力,cheers!
本质上,corpus向量体现了TF,而query向量体现了queyr的TF和IDF
我们即将放出Milvus 2.5 Versus ES的性能benchmark结果,从目前的测试结果看差不多差不多会有2-3倍的性能提升
点赞,期待测试结果
We have released Milvus 2.5 beta with the full text search feature available (https://github.com/milvus-io/milvus/releases/tag/v2.5.0-beta). The detailed documentation will be released soon, but here is a snippet:
from pymilvus import MilvusClient, DataType, Function, FunctionType
schema = MilvusClient.create_schema()
schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True, auto_id=True)
schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=1000, enable_analyzer=True)
schema.add_field(field_name="sparse", datatype=DataType.SPARSE_FLOAT_VECTOR)
bm25_function = Function(
name="text_bm25_emb", # Function name
input_field_names=["text"], # Name of the VARCHAR field containing raw text data
output_field_names=["sparse"], # Name of the SPARSE_FLOAT_VECTOR field reserved to store generated embeddings
function_type=FunctionType.BM25,
)
schema.add_function(bm25_function)
index_params = MilvusClient.prepare_index_params()
index_params.add_index(
field_name="sparse",
index_type="AUTOINDEX",
metric_type="BM25"
)
MilvusClient.create_collection(
collection_name='demo',
schema=schema,
index_params=index_params
)
MilvusClient.insert('demo', [
{'text': 'Artificial intelligence was founded as an academic discipline in 1956.'},
{'text': 'Alan Turing was the first person to conduct substantial research in AI.'},
{'text': 'Born in Maida Vale, London, Turing was raised in southern England.'},
])
search_params = {
'params': {'drop_ratio_search': 0.6},
}
MilvusClient.search(
collection_name='demo',
data=['Who started AI research?'],
anns_field='sparse',
limit=3,
search_params=search_params
)
Feel free to check it out!
你好, milvus在实现BM25时,预计对文档通过(当前文档作为Query,其余文档作为Doc)实现当前文档的embedding化。在计算真实Query时,通过IDF获得了embedding向量,最终通过两个向量的内积作为相似度。 这种做法和原始BM25计算公式还是不太一样。 麻烦问下,这种实现的出发点是什么呢,不同实现的性能是多少呢? BM25公式