run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
33.61k stars 4.72k forks source link

[Bug]:<MilvusException:(code=1100,message=the length (79145) of dynamic field exceeds max length (65536): invalid parameter[expected=valid length dynamic field][actual=length exceeds max length #13752

Open gfoxlin opened 1 month ago

gfoxlin commented 1 month ago

Bug Description

Question 1: Use a little txt file: The Milvus call the function ‘_create_hybrid_index()’ but the collection is not call 'self._conllection.load()',then the collection cant retrieve. Question 2: Use a little bigger PDF file: call the create index function 'VectorStoreIndex(nodes,embed_model=embed_model, transformers=transformers,storage_context=storage_context)' then report the error! <MilvusException:(code=1100,message=the length (79145) of dynamic field exceeds max length (65536): invalid parameter[expected=valid length dynamic field][actual=length exceeds max length

Version

0.10.38

Steps to Reproduce

milvus

   milvus_vector_store = MilvusVectorStore(
        uri=CFG.milvus_uri,
        collection_name=CFG.collection_name,
        dim=512,#512
        overwrite=args.overwrite,
        enable_sparse=True,
        sparse_embedding_function=ExampleEmbeddingFunction(),
        hybrid_ranker="RRFRanker",
        hybrid_ranker_params={"k": 60},
    )

 # node parser
   node_parser = SentenceWindowNodeParser.from_defaults(
        # how many sentences on either side to capture
        window_size=3,
        # the metadata key that holds the window of surrounding sentences
        window_metadata_key="window",
        # the metadata key that holds the original sentence
        original_text_metadata_key="original_sentence",
    )
    nodes = node_parser.get_nodes_from_documents(documents)
    print("Node length = "+ str(len(nodes)))# Output : Node length = 932

create index

   index = VectorStoreIndex(nodes,embed_model=embed_model, transformers=transformers,storage_context=storage_context)

Relevant Logs/Tracbacks

No response

gfoxlin commented 1 month ago

& Use the hybrid retrive the file index very very slowly.

dosubot[bot] commented 1 month ago

Hey there, @gfoxlin! I'm here to assist you with any bugs, questions, or contributions while you wait for a human maintainer. Let's tackle this together! 🤖

To address the issues you're encountering with Milvus when working with different file sizes and using hybrid retrieval, follow these steps:

Issue 1: Small Text Files

For small text files, ensure that the collection is properly loaded after creation. The collection must be loaded to be retrievable.

Issue 2: Larger PDF Files

For larger PDF files, the error indicates that the length of the dynamic field exceeds the maximum allowed length. You need to ensure that the dynamic field length does not exceed 65536.

Steps to Resolve

  1. Ensure Collection Creation and Loading:

    • Ensure the collection is created with the required dimensions and fields.
    • Load the collection after creation.
  2. Handle Dynamic Field Length:

    • Ensure that the dynamic field length does not exceed the maximum length of 65536.
  3. Enable Hybrid Retrieval:

    • Ensure hybrid retrieval is only enabled if Milvus version is 2.4.0 or later.
    • Use the appropriate sparse embedding function if hybrid retrieval is enabled.

Here is the relevant code to address these issues:

# Create the collection if it does not exist
if collection_name not in self.client.list_collections():
    if dim is None:
        raise ValueError("Dim argument required for collection creation.")
    if self.enable_sparse is False:
        self._milvusclient.create_collection(
            collection_name=collection_name,
            dimension=dim,
            primary_field_name=MILVUS_ID_FIELD,
            vector_field_name=embedding_field,
            id_type="string",
            metric_type=self.similarity_metric,
            max_length=65_535,
            consistency_level=consistency_level,
        )
    else:
        try:
            _ = DataType.SPARSE_FLOAT_VECTOR
        except Exception as e:
            logger.error(
                "Hybrid retrieval is only supported in Milvus 2.4.0 or later."
            )
            raise NotImplementedError(
                "Hybrid retrieval requires Milvus 2.4.0 or later."
            ) from e
        self._create_hybrid_index(collection_name)

if self.enable_sparse is False:
    self._collection = Collection(
        collection_name, using=self._milvusclient._using
    )
else:
    host, port = extract_host_port(uri)
    connections.connect("default", host=host, port=port)
    self._collection = Collection(collection_name)

self._create_index_if_required()

self.enable_sparse = enable_sparse
if self.enable_sparse is True and sparse_embedding_function is None:
    logger.warning("Sparse embedding function is not provided, using default.")
    self.sparse_embedding_function = get_defualt_sparse_embedding_function()
elif self.enable_sparse is True and sparse_embedding_function is not None:
    self.sparse_embedding_function = sparse_embedding_function
else:
    pass

logger.debug(f"Successfully created a new collection: {self.collection_name}")

Additional Best Practices

Example Configuration for SentenceWindowNodeParser

To optimize performance for large PDF files, configure the SentenceWindowNodeParser with appropriate parameters:

node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=2,  # Smaller window size for less metadata
    show_progress=True,  # Show progress for large documents
    include_metadata=False,  # Exclude unnecessary metadata
    include_prev_next_rel=False  # Exclude prev/next relationships
)

Hybrid Retrieval Performance

For hybrid retrieval, ensure that the Milvus version is 2.4.0 or later and that the sparse embedding function is properly configured. You can refer to the Milvus Hybrid Index Demo for a practical example of using Milvus with hybrid retrieval [1][2][3].

By following these steps and best practices, you should be able to resolve the issues and optimize the performance of your Milvus setup.

#### Details **About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

gfoxlin commented 1 month ago

@dosu How to fix the error: the length (79145) of dynamic field exceeds max length (65536)? Which params can limit the length output?

dosubot[bot] commented 1 month ago

To fix the error "the length (79145) of dynamic field exceeds max length (65536)" in Milvus, you need to adjust the max_length parameter of the dynamic field to ensure it does not exceed the maximum allowed length of 65536.

Here is the relevant part of the code where you need to make sure the max_length is set correctly:

def __init__(
        self,
        uri: str = "http://localhost:19530",
        token: str = "",
        collection_name: str = "llamalection",
        dim: Optional[int] = None,
        embedding_field: str = DEFAULT_EMBEDDING_KEY,
        doc_id_field: str = DEFAULT_DOC_ID_KEY,
        similarity_metric: str = "IP",
        consistency_level: str = "Strong",
        overwrite: bool = False,
        text_key: Optional[str] = None,
        index_config: Optional[dict] = None,
        search_config: Optional[dict] = None,
        **kwargs: Any,
    ) -> None:
        """Init params."""
        import_err_msg = (
            "`pymilvus` package not found, please run `pip install pymilvus`"
        )
        try:
            import pymilvus  # noqa
        except ImportError:
            raise ImportError(import_err_msg)

        from pymilvus import Collection, MilvusClient

        self.collection_name = collection_name
        self.dim = dim
        self.embedding_field = embedding_field
        self.doc_id_field = doc_id_field
        self.consistency_level = consistency_level
        self.overwrite = overwrite
        self.text_key = text_key
        self.index_config: Dict[str, Any] = index_config.copy() if index_config else {}
        self.search_config: Dict[str, Any] = (
            search_config.copy() if search_config else {}
        )

        if similarity_metric.lower() in ("ip"):
            self.similarity_metric = "IP"
        elif similarity_metric.lower() in ("l2", "euclidean"):
            self.similarity_metric = "L2"

        self.milvusclient = MilvusClient(
            uri=uri,
            token=token,
            **kwargs,
        )

        if self.overwrite and self.collection_name in self.client.list_collections():
            self.milvusclient.drop_collection(self.collection_name)

        if self.collection_name not in self.client.list_collections():
            if self.dim is None:
                raise ValueError("Dim argument required for collection creation.")
            self.milvusclient.create_collection(
                collection_name=self.collection_name,
                dimension=self.dim,
                primary_field_name=MILVUS_ID_FIELD,
                vector_field_name=self.embedding_field,
                id_type="string",
                metric_type=self.similarity_metric,
                max_length=65_535,  # Ensure this value is 65535 or less
                consistency_level=self.consistency_level,
            )

        self.collection = Collection(
            self.collection_name, using=self.milvusclient._using
        )
        self._create_index_if_required()

        logger.debug(f"Successfully created a new collection: {self.collection_name}")

Ensure that the max_length parameter for the id field is set to 65535 or less to avoid exceeding the maximum allowed length [1][2].

#### Details **About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

RussellLuo commented 1 month ago

Hi @gfoxlin, as for Question 2, 65535 is the hard limit of a string in Milvus. Therefore, we should ensure that the length of each TextNode does not exceed this limit.

I don't know the details of your PDF, but I guess the text in some nodes you obtained have already exceeded 65536. You can try to show the text length of each node as below:

...

nodes = node_parser.get_nodes_from_documents(documents)
for i, node in enumerate(nodes):
    print(f"Text length of node {i}: {len(node.text)}")

...

If that's the case, there might be some problem with your PDF when using the default sentence_splitter:

https://github.com/run-llama/llama_index/blob/038d5105b684e5286b5771e7722ad3a9e3e8ec75/llama-index-core/llama_index/core/node_parser/text/sentence_window.py#L70

You could do some analysis on your PDF and try out alternative splitters by explicitly specify it:

node_parser = SentenceWindowNodeParser.from_defaults(
    ...
    sentence_splitter=<YOUR-SPLITTER-HERE>,
    ...
)
gfoxlin commented 1 month ago

Hi @RussellLuo I changed the code like this But it didnt work too.

        node_parser = SentenceWindowNodeParser.from_defaults(
            # how many sentences on either side to capture
            window_size=2,
            sentence_splitter=SentenceSplitter(chunk_size=400, chunk_overlap=20),
            # the metadata key that holds the window of surrounding sentences
            window_metadata_key="window",
            # the metadata key that holds the original sentence
            original_text_metadata_key="original_sentence",
        )

Error info

Loading Embedder... Parsing nodes: 0%|▎ | 1/237 [00:00<00:00, 5447.15it/s] Traceback (most recent call last): File "/home/mishulin/llm-gen-report/app/core/rag/loader.py", line 149, in data_loader.ingest(embed_model=embed_model, llm=llm) File "/home/mishulin/llm-gen-report/app/core/rag/loader.py", line 98, in ingest nodes = node_parser.get_nodes_from_documents(documents,show_progress=True) File "/home/mishulin/.conda/envs/llamaindex310/lib/python3.10/site-packages/llama_index/core/node_parser/interface.py", line 129, in get_nodes_from_documents nodes = self._parse_nodes(documents, show_progress=show_progress, kwargs) File "/home/mishulin/.conda/envs/llamaindex310/lib/python3.10/site-packages/llama_index/core/node_parser/text/sentence_window.py", line 96, in _parse_nodes nodes = self.build_window_nodes_from_documents([node]) File "/home/mishulin/.conda/envs/llamaindex310/lib/python3.10/site-packages/llama_index/core/node_parser/text/sentence_window.py", line 108, in build_window_nodes_from_documents text_splits = self.sentence_splitter(text) File "/home/mishulin/.conda/envs/llamaindex310/lib/python3.10/site-packages/llama_index/core/node_parser/interface.py", line 157, in call return self.get_nodes_from_documents(nodes, kwargs) File "/home/mishulin/.conda/envs/llamaindex310/lib/python3.10/site-packages/llama_index/core/node_parser/interface.py", line 124, in get_nodes_from_documents doc_id_todocument = {doc.id: doc for doc in documents} File "/home/mishulin/.conda/envs/llamaindex310/lib/python3.10/site-packages/llama_index/core/node_parser/interface.py", line 124, in doc_id_todocument = {doc.id: doc for doc in documents} AttributeError: 'str' object has no attribute 'id_'

RussellLuo commented 1 month ago

To make node_parser.get_nodes_from_documents() work properly, you need to pass a list of Document (other than a list of str) for the first parameter. For convenience, you can leverage SimpleDirectoryReader:

documents = SimpleDirectoryReader("path/to/directory").load_data()
gfoxlin commented 1 month ago

To make node_parser.get_nodes_from_documents() work properly, you need to pass a list of Document (other than a list of str) for the first parameter. For convenience, you can leverage SimpleDirectoryReader:

documents = SimpleDirectoryReader("path/to/directory").load_data()

@RussellLuo Thanks! I dont find any quesiton in codes,please help me to check them all. the code any question?

     # The model has been downloaded to the 'models' local folder
     # 1. embedding model = "models/bge-small-zh-v1.5"
     # 2. BGEM3FlagModel("models/bge-m3", use_fp16=False)

       documents = SimpleDirectoryReader("datasets/").load_data()

        milvus_vector_store = MilvusVectorStore(
            uri=CFG.milvus_uri,
            collection_name=CFG.collection_name,
            dim=512,
            overwrite=True,
            enable_sparse=True,
           # as the llama-index milvus example
            sparse_embedding_function=ExampleEmbeddingFunction(),
            hybrid_ranker="RRFRanker",
            hybrid_ranker_params={"k": 60},
        )

        node_parser = SentenceWindowNodeParser.from_defaults(
            window_size=2,
           #Last question as you suggest add this line.then get the error (AttributeError: 'str' object has no attribute 'id_')
            sentence_splitter=SentenceSplitter(chunk_size=400, chunk_overlap=20),
            window_metadata_key="window",
            original_text_metadata_key="original_sentence",
        )
        nodes = node_parser.get_nodes_from_documents(documents,show_progress=True)
        for i, node in enumerate(nodes):
            print(f"Text length of node {i}: {len(node.text)}")
            if len(node.text)>65535:
               # no one 
                break
        storage_context = StorageContext.from_defaults(vector_store=milvus_vector_store)
        index = VectorStoreIndex(nodes,embed_model=embed_model, storage_context=storage_context,show_progress=True)
RussellLuo commented 1 month ago

@gfoxlin Try this instead:

node_parser = SentenceWindowNodeParser.from_defaults(
    ...
    # 1. This parameter only accepts a function whose signature is `(str) -> list[str]`
    # 2. Try a smaller `chunk_size` since the final length Milvus got is N times greater than `chunk_size`
    sentence_splitter=SentenceSplitter(chunk_size=100, chunk_overlap=20).split_text,
    ...
)

The reason for the second comment is complicated. Milvus hybrid retrieval (i.e. enable_sparse=True) will enable dynamic field:

https://github.com/run-llama/llama_index/blob/03bb37fa2334a9a02c47363f64980602ac747d22/llama-index-integrations/vector_stores/llama-index-vector-stores-milvus/llama_index/vector_stores/milvus/base.py#L567-L584

As shown above, this feature will cause all undefined fields (i.e. fields whose names are not "id", "embedding" or "sparse_embedding") to be merged into a single dynamic field. Therefore, it is possible for the length of this merged field to exceed 65535.

weiweizwc98 commented 4 weeks ago

I encountered the exact same problem as you.

zydmtaichi commented 2 weeks ago

I encountered the same problem, @RussellLuo 's answer is very helpful, thanks a lot