Open yidasanqian opened 2 months ago
entity["content"]="""
无机预涂板是一种环保板材。无机预涂板通常采用防火、抗菌、耐腐蚀和易清洁等,能够有效提高建筑物的装修质量和性能。\n以下是无机预涂板的环保特点:\n无机材料:无机预涂板基板采用无石棉硅酸钙板,不含有害的有机物,不会释放有害气体,不会对室内空气质量造成污染。\n绿色环保:无机预涂板符合绿色环保要求,不含有害物质,是一种绿色环保的装饰材料。\n耐久性:无机预涂板具有良好的耐久性,不易腐烂、老化、脆化和变形,使用寿命长,不会频繁更换,减少资源浪费。\n总之,无机预涂板是一种环保板材,符合绿色环保要求,对室内空气质量和人体健康无害,同时具有不错的装饰效果和耐久性。
"""
"bm25_msmarco_v1.json" is only for English corpus, you need to fit parameters on your own documents. Here is code example
from pymilvus.model.sparse.bm25.tokenizers import build_default_analyzer
from pymilvus.model.sparse import BM25EmbeddingFunction
from pymilvus import MilvusClient, DataType
analyzer = build_default_analyzer(language="zh")
docs = [
"无机预涂板是一种具有优良性能的环保材料,常被应用于防火、抗菌、耐化学腐蚀等领域。",
"无机预涂板以其卓越的耐火性、抗菌性和易维护性,被广泛应用于各类建筑场景。",
"无机预涂板拥有防火、耐腐蚀、易清洁等特点,成为现代建筑中环保材料的首选。",
"无机预涂板兼具环保和实用性,具有防火、抗菌、耐酸碱等多种优异性能。",
"无机预涂板由于其出色的耐火性能、抗菌功能和环保特性,广泛应用于医院、实验室等场所。"
]
bm25_ef = BM25EmbeddingFunction(analyzer)
bm25_ef.fit(docs)
docs_embeddings = bm25_ef.encode_documents(docs)
query = '无机预涂板有耐火性吗?'
query_embeddings = bm25_ef.encode_queries([query])
client = MilvusClient(uri='test.db')
schema = client.create_schema(
auto_id=True,
enable_dynamic_fields=True,
)
schema.add_field(field_name="pk", datatype=DataType.VARCHAR, is_primary=True, max_length=100)
schema.add_field(field_name="sparse_vector", datatype=DataType.SPARSE_FLOAT_VECTOR)
schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=65535)
index_params = client.prepare_index_params()
client.create_collection(collection_name="test_sparse_vector", schema=schema)
index_params.add_index(
field_name="sparse_vector",
index_name="sparse_inverted_index",
index_type="SPARSE_INVERTED_INDEX",
metric_type="IP",
)
# Create index
client.create_index(collection_name="test_sparse_vector", index_params=index_params)
search_params = {
"metric_type": "IP",
"params": {}
}
for i in range(len(docs)):
entity = {'sparse_vector': docs_embeddings[[i]], 'text':docs[i]}
client.insert(collection_name="test_sparse_vector", data=entity)
results = client.search(collection_name="test_sparse_vector", data=query_embeddings[[0]], output_fields=['text'], search_params=search_params)
print(results)
Documents are dynamically added to milvus and are more than 1 million in number, do I have to full fit all documents every time I execute a bm25 query?
Although it is mathematically correct that BM25 should fit all inserted documents, a more practical approach is to save
your parameters after fitting a large number of texts, and then load
these saved parameters during query time to avoid refitting.
These documents take up about 32 GB of memory. I need to load them all into memory, then execute fit
, and finally call save
, right? Do I need to do this process every time I add a document? Is there a way to incrementally update the parameters?
yes, currently there is no incremental updates for bm25 and it is planned. Also Milvus will support native bm25, please stay tuned.
code:
trace back output:
What's the reason? How to solve it?