milvus-io / milvus-model

The embedding/reranking model zoo help user to convert their unstructured data into embeedings
Apache License 2.0
18 stars 12 forks source link

invalid input for sparse float vector #35

Open yidasanqian opened 1 week ago

yidasanqian commented 1 week ago

code:

analyzer = build_default_analyzer(language="zh")
bm25_ef = BM25EmbeddingFunction(analyzer)
bm25_ef.load("D:/Downloads/bm25_msmarco_v1.json")

def test():
  entities = [....]
  for entity in entities:    
    docs_embeddings = bm25_ef.encode_documents([entity["content"]])       
    # Convert csr_array to the format Milvus expects (List of Dictionaries)
    sparse_vector = {int(idx): float(val) for idx, val in zip(docs_embeddings[0].indices, docs_embeddings[0].data)}
    entity["content_sparse"] = sparse_vector
  res = client.upsert(collection_name=INDEX_NAME, data=entities)       
  return res["ids"]

trace back output:

  File "d:\Develop\conda\envs\mkb\lib\concurrent\futures\_base.py", line 451, in result
    return self.__get_result()
  File "d:\Develop\conda\envs\mkb\lib\concurrent\futures\_base.py", line 403, in __get_result
    raise self._exception
  File "d:\Develop\conda\envs\mkb\lib\concurrent\futures\thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "D:\Develop\CodeProjects\mkb\src\tests\mkb\storage\build_1k_data_for_milvus.py", line 52, in append_rag_eval_entry
    res = client.upsert(collection_name=INDEX_NAME, data=entities)
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\milvus_client\milvus_client.py", line 276, in upsert
    raise ex from ex
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\milvus_client\milvus_client.py", line 272, in upsert
    res = conn.upsert_rows(
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\decorators.py", line 148, in handler
    raise e from e
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\decorators.py", line 144, in handler
    return func(*args, **kwargs)
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\decorators.py", line 183, in handler
    return func(self, *args, **kwargs)
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\decorators.py", line 123, in handler
    raise e from e
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\decorators.py", line 87, in handler
    return func(*args, **kwargs)
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\client\grpc_handler.py", line 715, in upsert_rows
    request = self._prepare_row_upsert_request(
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\client\grpc_handler.py", line 696, in _prepare_row_upsert_request
    return Prepare.row_upsert_param(
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\client\prepare.py", line 461, in row_upsert_param
    return cls._parse_row_request(request, fields_info, enable_dynamic, entities)
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\client\prepare.py", line 389, in _parse_row_request
    entity_helper.pack_field_value_to_field_data(v, field_data, field_info)
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\client\entity_helper.py", line 361, in pack_field_value_to_field_data
    raise ParamError(message="invalid input for sparse float vector")
pymilvus.exceptions.ParamError: <ParamError: (code=1, message=invalid input for sparse float vector)>

What's the reason? How to solve it?

yidasanqian commented 1 week ago
entity["content"]="""
无机预涂板是一种环保板材。无机预涂板通常采用防火、抗菌、耐腐蚀和易清洁等,能够有效提高建筑物的装修质量和性能。\n以下是无机预涂板的环保特点:\n无机材料:无机预涂板基板采用无石棉硅酸钙板,不含有害的有机物,不会释放有害气体,不会对室内空气质量造成污染。\n绿色环保:无机预涂板符合绿色环保要求,不含有害物质,是一种绿色环保的装饰材料。\n耐久性:无机预涂板具有良好的耐久性,不易腐烂、老化、脆化和变形,使用寿命长,不会频繁更换,减少资源浪费。\n总之,无机预涂板是一种环保板材,符合绿色环保要求,对室内空气质量和人体健康无害,同时具有不错的装饰效果和耐久性。
"""

image

wxywb commented 1 week ago

"bm25_msmarco_v1.json" is only for English corpus, you need to fit parameters on your own documents. Here is code example

from pymilvus.model.sparse.bm25.tokenizers import build_default_analyzer
from pymilvus.model.sparse import BM25EmbeddingFunction
from pymilvus import MilvusClient,  DataType

analyzer = build_default_analyzer(language="zh")

docs = [
    "无机预涂板是一种具有优良性能的环保材料,常被应用于防火、抗菌、耐化学腐蚀等领域。",
    "无机预涂板以其卓越的耐火性、抗菌性和易维护性,被广泛应用于各类建筑场景。",
    "无机预涂板拥有防火、耐腐蚀、易清洁等特点,成为现代建筑中环保材料的首选。",
    "无机预涂板兼具环保和实用性,具有防火、抗菌、耐酸碱等多种优异性能。",
    "无机预涂板由于其出色的耐火性能、抗菌功能和环保特性,广泛应用于医院、实验室等场所。"
]

bm25_ef = BM25EmbeddingFunction(analyzer)
bm25_ef.fit(docs)

docs_embeddings = bm25_ef.encode_documents(docs)

query = '无机预涂板有耐火性吗?'

query_embeddings = bm25_ef.encode_queries([query])

client = MilvusClient(uri='test.db')

schema = client.create_schema(
    auto_id=True,
    enable_dynamic_fields=True,
)

schema.add_field(field_name="pk", datatype=DataType.VARCHAR, is_primary=True, max_length=100)
schema.add_field(field_name="sparse_vector", datatype=DataType.SPARSE_FLOAT_VECTOR)
schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=65535)

index_params = client.prepare_index_params()

client.create_collection(collection_name="test_sparse_vector", schema=schema)
index_params.add_index(
    field_name="sparse_vector",
    index_name="sparse_inverted_index",
    index_type="SPARSE_INVERTED_INDEX",
    metric_type="IP",
)

# Create index
client.create_index(collection_name="test_sparse_vector", index_params=index_params)

search_params = {
    "metric_type": "IP",
    "params": {}
}
for i in range(len(docs)):
    entity = {'sparse_vector': docs_embeddings[[i]], 'text':docs[i]}
    client.insert(collection_name="test_sparse_vector", data=entity)

results = client.search(collection_name="test_sparse_vector", data=query_embeddings[[0]], output_fields=['text'], search_params=search_params)
print(results)
yidasanqian commented 1 week ago

Documents are dynamically added to milvus and are more than 1 million in number, do I have to full fit all documents every time I execute a bm25 query?

wxywb commented 1 week ago

Although it is mathematically correct that BM25 should fit all inserted documents, a more practical approach is to save your parameters after fitting a large number of texts, and then load these saved parameters during query time to avoid refitting.

yidasanqian commented 1 week ago

These documents take up about 32 GB of memory. I need to load them all into memory, then execute fit, and finally call save, right? Do I need to do this process every time I add a document? Is there a way to incrementally update the parameters?

wxywb commented 1 week ago

yes, currently there is no incremental updates for bm25 and it is planned. Also Milvus will support native bm25, please stay tuned.