Open gfoxlin opened 6 months ago
& Use the hybrid retrive the file index very very slowly.
Hey there, @gfoxlin! I'm here to assist you with any bugs, questions, or contributions while you wait for a human maintainer. Let's tackle this together! 🤖
To address the issues you're encountering with Milvus when working with different file sizes and using hybrid retrieval, follow these steps:
For small text files, ensure that the collection is properly loaded after creation. The collection must be loaded to be retrievable.
For larger PDF files, the error indicates that the length of the dynamic field exceeds the maximum allowed length. You need to ensure that the dynamic field length does not exceed 65536.
Ensure Collection Creation and Loading:
Handle Dynamic Field Length:
Enable Hybrid Retrieval:
Here is the relevant code to address these issues:
# Create the collection if it does not exist
if collection_name not in self.client.list_collections():
if dim is None:
raise ValueError("Dim argument required for collection creation.")
if self.enable_sparse is False:
self._milvusclient.create_collection(
collection_name=collection_name,
dimension=dim,
primary_field_name=MILVUS_ID_FIELD,
vector_field_name=embedding_field,
id_type="string",
metric_type=self.similarity_metric,
max_length=65_535,
consistency_level=consistency_level,
)
else:
try:
_ = DataType.SPARSE_FLOAT_VECTOR
except Exception as e:
logger.error(
"Hybrid retrieval is only supported in Milvus 2.4.0 or later."
)
raise NotImplementedError(
"Hybrid retrieval requires Milvus 2.4.0 or later."
) from e
self._create_hybrid_index(collection_name)
if self.enable_sparse is False:
self._collection = Collection(
collection_name, using=self._milvusclient._using
)
else:
host, port = extract_host_port(uri)
connections.connect("default", host=host, port=port)
self._collection = Collection(collection_name)
self._create_index_if_required()
self.enable_sparse = enable_sparse
if self.enable_sparse is True and sparse_embedding_function is None:
logger.warning("Sparse embedding function is not provided, using default.")
self.sparse_embedding_function = get_defualt_sparse_embedding_function()
elif self.enable_sparse is True and sparse_embedding_function is not None:
self.sparse_embedding_function = sparse_embedding_function
else:
pass
logger.debug(f"Successfully created a new collection: {self.collection_name}")
enable_sparse
is set to True
.overwrite
is set to True
.SentenceWindowNodeParser
To optimize performance for large PDF files, configure the SentenceWindowNodeParser
with appropriate parameters:
node_parser = SentenceWindowNodeParser.from_defaults(
window_size=2, # Smaller window size for less metadata
show_progress=True, # Show progress for large documents
include_metadata=False, # Exclude unnecessary metadata
include_prev_next_rel=False # Exclude prev/next relationships
)
For hybrid retrieval, ensure that the Milvus version is 2.4.0 or later and that the sparse embedding function is properly configured. You can refer to the Milvus Hybrid Index Demo for a practical example of using Milvus with hybrid retrieval [1][2][3].
By following these steps and best practices, you should be able to resolve the issues and optimize the performance of your Milvus setup.
To continue the conversation, mention @dosu.
@dosu How to fix the error: the length (79145) of dynamic field exceeds max length (65536)? Which params can limit the length output?
To fix the error "the length (79145) of dynamic field exceeds max length (65536)" in Milvus, you need to adjust the max_length
parameter of the dynamic field to ensure it does not exceed the maximum allowed length of 65536.
Here is the relevant part of the code where you need to make sure the max_length
is set correctly:
def __init__(
self,
uri: str = "http://localhost:19530",
token: str = "",
collection_name: str = "llamalection",
dim: Optional[int] = None,
embedding_field: str = DEFAULT_EMBEDDING_KEY,
doc_id_field: str = DEFAULT_DOC_ID_KEY,
similarity_metric: str = "IP",
consistency_level: str = "Strong",
overwrite: bool = False,
text_key: Optional[str] = None,
index_config: Optional[dict] = None,
search_config: Optional[dict] = None,
**kwargs: Any,
) -> None:
"""Init params."""
import_err_msg = (
"`pymilvus` package not found, please run `pip install pymilvus`"
)
try:
import pymilvus # noqa
except ImportError:
raise ImportError(import_err_msg)
from pymilvus import Collection, MilvusClient
self.collection_name = collection_name
self.dim = dim
self.embedding_field = embedding_field
self.doc_id_field = doc_id_field
self.consistency_level = consistency_level
self.overwrite = overwrite
self.text_key = text_key
self.index_config: Dict[str, Any] = index_config.copy() if index_config else {}
self.search_config: Dict[str, Any] = (
search_config.copy() if search_config else {}
)
if similarity_metric.lower() in ("ip"):
self.similarity_metric = "IP"
elif similarity_metric.lower() in ("l2", "euclidean"):
self.similarity_metric = "L2"
self.milvusclient = MilvusClient(
uri=uri,
token=token,
**kwargs,
)
if self.overwrite and self.collection_name in self.client.list_collections():
self.milvusclient.drop_collection(self.collection_name)
if self.collection_name not in self.client.list_collections():
if self.dim is None:
raise ValueError("Dim argument required for collection creation.")
self.milvusclient.create_collection(
collection_name=self.collection_name,
dimension=self.dim,
primary_field_name=MILVUS_ID_FIELD,
vector_field_name=self.embedding_field,
id_type="string",
metric_type=self.similarity_metric,
max_length=65_535, # Ensure this value is 65535 or less
consistency_level=self.consistency_level,
)
self.collection = Collection(
self.collection_name, using=self.milvusclient._using
)
self._create_index_if_required()
logger.debug(f"Successfully created a new collection: {self.collection_name}")
Ensure that the max_length
parameter for the id
field is set to 65535 or less to avoid exceeding the maximum allowed length [1][2].
To continue the conversation, mention @dosu.
Hi @gfoxlin, as for Question 2, 65535 is the hard limit of a string in Milvus. Therefore, we should ensure that the length of each TextNode does not exceed this limit.
I don't know the details of your PDF, but I guess the text in some nodes you obtained have already exceeded 65536. You can try to show the text length of each node as below:
...
nodes = node_parser.get_nodes_from_documents(documents)
for i, node in enumerate(nodes):
print(f"Text length of node {i}: {len(node.text)}")
...
If that's the case, there might be some problem with your PDF when using the default sentence_splitter
:
You could do some analysis on your PDF and try out alternative splitters by explicitly specify it:
node_parser = SentenceWindowNodeParser.from_defaults(
...
sentence_splitter=<YOUR-SPLITTER-HERE>,
...
)
Hi @RussellLuo I changed the code like this But it didnt work too.
node_parser = SentenceWindowNodeParser.from_defaults(
# how many sentences on either side to capture
window_size=2,
sentence_splitter=SentenceSplitter(chunk_size=400, chunk_overlap=20),
# the metadata key that holds the window of surrounding sentences
window_metadata_key="window",
# the metadata key that holds the original sentence
original_text_metadata_key="original_sentence",
)
Loading Embedder...
Parsing nodes: 0%|▎ | 1/237 [00:00<00:00, 5447.15it/s]
Traceback (most recent call last):
File "/home/mishulin/llm-gen-report/app/core/rag/loader.py", line 149, in
To make node_parser.get_nodes_from_documents()
work properly, you need to pass a list of Document (other than a list of str) for the first parameter. For convenience, you can leverage SimpleDirectoryReader:
documents = SimpleDirectoryReader("path/to/directory").load_data()
To make
node_parser.get_nodes_from_documents()
work properly, you need to pass a list of Document (other than a list of str) for the first parameter. For convenience, you can leverage SimpleDirectoryReader:documents = SimpleDirectoryReader("path/to/directory").load_data()
@RussellLuo Thanks! I dont find any quesiton in codes,please help me to check them all. the code any question?
# The model has been downloaded to the 'models' local folder
# 1. embedding model = "models/bge-small-zh-v1.5"
# 2. BGEM3FlagModel("models/bge-m3", use_fp16=False)
documents = SimpleDirectoryReader("datasets/").load_data()
milvus_vector_store = MilvusVectorStore(
uri=CFG.milvus_uri,
collection_name=CFG.collection_name,
dim=512,
overwrite=True,
enable_sparse=True,
# as the llama-index milvus example
sparse_embedding_function=ExampleEmbeddingFunction(),
hybrid_ranker="RRFRanker",
hybrid_ranker_params={"k": 60},
)
node_parser = SentenceWindowNodeParser.from_defaults(
window_size=2,
#Last question as you suggest add this line.then get the error (AttributeError: 'str' object has no attribute 'id_')
sentence_splitter=SentenceSplitter(chunk_size=400, chunk_overlap=20),
window_metadata_key="window",
original_text_metadata_key="original_sentence",
)
nodes = node_parser.get_nodes_from_documents(documents,show_progress=True)
for i, node in enumerate(nodes):
print(f"Text length of node {i}: {len(node.text)}")
if len(node.text)>65535:
# no one
break
storage_context = StorageContext.from_defaults(vector_store=milvus_vector_store)
index = VectorStoreIndex(nodes,embed_model=embed_model, storage_context=storage_context,show_progress=True)
@gfoxlin Try this instead:
node_parser = SentenceWindowNodeParser.from_defaults(
...
# 1. This parameter only accepts a function whose signature is `(str) -> list[str]`
# 2. Try a smaller `chunk_size` since the final length Milvus got is N times greater than `chunk_size`
sentence_splitter=SentenceSplitter(chunk_size=100, chunk_overlap=20).split_text,
...
)
The reason for the second comment is complicated. Milvus hybrid retrieval (i.e. enable_sparse=True
) will enable dynamic field:
As shown above, this feature will cause all undefined fields (i.e. fields whose names are not "id", "embedding" or "sparse_embedding") to be merged into a single dynamic field. Therefore, it is possible for the length of this merged field to exceed 65535.
I encountered the exact same problem as you.
I encountered the same problem, @RussellLuo 's answer is very helpful, thanks a lot
I also got this error when saving the nodes from using llamaparse with LlamaParseJsonNodeParser() into a Milvus database created by MilvusVectorStore().
How to resolve this if I still want to keep using Milvus database?
I have the same issue, what is the solution?
I just posted this question which is also similar to this problem. On my end I am not working with texts I am working with images. https://github.com/run-llama/llama_index/issues/16763
Hi @bubl-ai , I have checked and seems to be similar to my issue, the strange thing is that in Milvus lite it doesn't happens but in Milvus happens, I suppose that is because Lite runs usually on a laptop and there is no issue with the size. The thing is that I did a first approach reducing the size of the metadata but it didn't worked, and I thought that maybe there was some parameter configuration thing, I will have to try again.
still I'm considering also two possible workarounds:
Bug Description
Question 1: Use a little txt file: The Milvus call the function ‘_create_hybrid_index()’ but the collection is not call 'self._conllection.load()',then the collection cant retrieve. Question 2: Use a little bigger PDF file: call the create index function 'VectorStoreIndex(nodes,embed_model=embed_model, transformers=transformers,storage_context=storage_context)' then report the error! <MilvusException:(code=1100,message=the length (79145) of dynamic field exceeds max length (65536): invalid parameter[expected=valid length dynamic field][actual=length exceeds max length
Version
0.10.38
Steps to Reproduce
milvus
create index
Relevant Logs/Tracbacks
No response