run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.25k stars 4.95k forks source link

[Question]: SentenceNodeParser ignores max_length of embed model #14148

Open Bearsaerker opened 2 months ago

Bearsaerker commented 2 months ago

Question Validation

Question

I have a problem with the sentenceWindowParser and I think that it has to do with my configuration, while I'm not entirely sure if its a bug. So I try to increase the window size to 14 so that large chunks are passed into the llm context, but some passages are very very big. So I tried to set the max_length to 2000 tokens in the embed model, because otherwise I'd get the error, that the initial token count is overstepped.

This is my little snippet in which I tried to accomplish this.

node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=14,
    window_metadata_key="window",
    original_text_metadata_key="original_text",
)

# base node parser is a sentence splitter
text_splitter = SentenceSplitter()

Settings.embed_model = HuggingFaceEmbedding(model_name="/home/luis/Dokumente/ChatWithDoc English/Models/HuggingFace/bge-m3", use_fp16=True, embed_batch_size=2, max_length=2000)
#Settings.embed_model = OptimumEmbedding(folder_name="/home/luis/Dokumente/ChatWithDoc English/Models/Onnx/bge-m3")
Settings.text_splitter = text_splitter
Settings.tokenizer = AutoTokenizer.from_pretrained(
    "/home/luis/Dokumente/ChatWithDoc English/Models/Tokenizer/llama"
)

Edit: The Problem is, that some chunks are still + 2000 tokens long which prompts the error, that the tokens are too many. Just to clearify this again Is there something wrong with that?

dosubot[bot] commented 2 months ago

The issue you're encountering with the SentenceWindowNodeParser and the max_length parameter in the HuggingFaceEmbedding model is likely due to the way the SentenceWindowNodeParser handles large passages. The SentenceWindowNodeParser splits the document into nodes, each containing a window of surrounding sentences based on the window_size parameter. However, it excludes the window metadata and original text metadata from the embedding and LLM processes to avoid exceeding the max_length parameter.

Here's a breakdown of the relevant configurations and considerations:

  1. Window Size: Setting a large window_size (e.g., 14) can result in very large chunks, which might exceed the max_length parameter of the embedding model.

  2. Excluding Metadata: The SentenceWindowNodeParser excludes the window metadata and original text metadata from embedding and LLM processes to manage the token count effectively.

  3. Max Length: The max_length parameter in the HuggingFaceEmbedding model specifies the maximum length of the input. If this length is exceeded, it can cause errors.

Here is the relevant code snippet from the SentenceWindowNodeParser:

# exclude window metadata from embed and llm
node.excluded_embed_metadata_keys.extend(
    [self.window_metadata_key, self.original_text_metadata_key]
)
node.excluded_llm_metadata_keys.extend(
    [self.window_metadata_key, self.original_text_metadata_key]
)

To address your issue, ensure that the max_length parameter is appropriately set and that the SentenceWindowNodeParser is configured to handle large passages without exceeding this limit. You might need to adjust the window_size or further split the text to manage the token count effectively.

Additionally, consider the following performance optimizations:

By carefully configuring these parameters, you can manage large passages and avoid exceeding the max_length parameter [1][2].

To continue talking to Dosu, mention @dosu.

Bearsaerker commented 2 months ago

@dosu Did I misunderstand that the value of "max_length" is only related to the embed model max length but not to the sentence window max length? Because that would explain that the chunks are still 2000+ Tokens, because the setting only relates to the embed model but not the length of the sentence window node

dosubot[bot] commented 2 months ago

Yes, you are correct. The max_length parameter in the HuggingFaceEmbedding model only controls the maximum length of the input for the embedding model itself. It does not directly limit the length of the sentence window nodes created by the SentenceWindowNodeParser.

The SentenceWindowNodeParser does not enforce a maximum token length on the nodes it generates. It simply concatenates the windowed sentences and stores them in the metadata without checking the length against any maximum limit [1].

To ensure that the nodes do not exceed the max_length parameter, you would need to implement additional logic to split or truncate the nodes accordingly before passing them to the embedding model. This could involve adjusting the window_size or adding a step to check and enforce the token length of each node.

To continue talking to Dosu, mention @dosu.

Bearsaerker commented 2 months ago

Unfortunately I really don't know how github works, but I implemented a solution for my Problem. I guess this would be a pull request? The size of the sentence window is now also dependent on the set max token size. To make it completely irrelevant if you only want to use the sentence window, we could use a blatently high token count so that it gets ignored. But so far I like this very much. Just for the matter of convention, how would it work if I'd suggest a code change on github @logan-markewich ? I'm really very new to this lol. This is the new implementation of the sentence_window.py:

"""Simple node parser."""

from typing import Any, Callable, List, Optional, Sequence

from transformers import AutoTokenizer
from llama_index.core.bridge.pydantic import Field
from llama_index.core.callbacks.base import CallbackManager
from llama_index.core.node_parser.interface import NodeParser
from llama_index.core.node_parser.node_utils import (
    build_nodes_from_splits,
    default_id_func,
)
from llama_index.core.node_parser.text.utils import split_by_sentence_tokenizer
from llama_index.core.schema import BaseNode, Document
from llama_index.core.utils import get_tqdm_iterable

DEFAULT_WINDOW_SIZE = 3
DEFAULT_WINDOW_METADATA_KEY = "window"
DEFAULT_OG_TEXT_METADATA_KEY = "original_text"
DEFAULT_WINDOW_TOKEN_SIZE = 2000

class SentenceWindowNodeParser(NodeParser):
    """Sentence window node parser.

    Splits a document into Nodes, with each node being a sentence.
    Each node contains a window from the surrounding sentences in the metadata.

    Args:
        sentence_splitter (Optional[Callable]): splits text into sentences
        include_metadata (bool): whether to include metadata in nodes
        include_prev_next_rel (bool): whether to include prev/next relationships
    """

    sentence_splitter: Callable[[str], List[str]] = Field(
        default_factory=split_by_sentence_tokenizer,
        description="The text splitter to use when splitting documents.",
        exclude=True,
    )
    window_size: int = Field(
        default=DEFAULT_WINDOW_SIZE,
        description="The number of sentences on each side of a sentence to capture.",
        gt=0,
    )
    window_metadata_key: str = Field(
        default=DEFAULT_WINDOW_METADATA_KEY,
        description="The metadata key to store the sentence window under.",
    )
    original_text_metadata_key: str = Field(
        default=DEFAULT_OG_TEXT_METADATA_KEY,
        description="The metadata key to store the original sentence in.",
    )
    tokenizer: AutoTokenizer = Field(
        default_factory=lambda: AutoTokenizer.from_pretrained("mlabonne/NeuralDaredevil-8B-abliterated"),
        description="The tokenizer to use for counting tokens.",
        exclude=True,
    )
    window_token_size: int = Field(
        default=DEFAULT_WINDOW_TOKEN_SIZE,
        description="The maximum token size for the window.",
        gt=0,
    )

    @classmethod
    def class_name(cls) -> str:
        return "SentenceWindowNodeParser"

    @classmethod
    def from_defaults(
        cls,
        sentence_splitter: Optional[Callable[[str], List[str]]] = None,
        window_size: int = DEFAULT_WINDOW_SIZE,
        window_token_size: int = DEFAULT_WINDOW_TOKEN_SIZE,
        window_metadata_key: str = DEFAULT_WINDOW_METADATA_KEY,
        original_text_metadata_key: str = DEFAULT_OG_TEXT_METADATA_KEY,
        include_metadata: bool = True,
        include_prev_next_rel: bool = True,
        callback_manager: Optional[CallbackManager] = None,
        id_func: Optional[Callable[[int, Document], str]] = None,
    ) -> "SentenceWindowNodeParser":
        callback_manager = callback_manager or CallbackManager([])
        sentence_splitter = sentence_splitter or split_by_sentence_tokenizer()
        id_func = id_func or default_id_func

        return cls(
            sentence_splitter=sentence_splitter,
            window_size=window_size,
            window_token_size=window_token_size,
            window_metadata_key=window_metadata_key,
            original_text_metadata_key=original_text_metadata_key,
            include_metadata=include_metadata,
            include_prev_next_rel=include_prev_next_rel,
            callback_manager=callback_manager,
            id_func=id_func,
        )

    def _parse_nodes(
        self,
        nodes: Sequence[BaseNode],
        show_progress: bool = False,
        **kwargs: Any,
    ) -> List[BaseNode]:
        """Parse document into nodes."""
        all_nodes: List[BaseNode] = []
        nodes_with_progress = get_tqdm_iterable(nodes, show_progress, "Parsing nodes")

        for node in nodes_with_progress:
            nodes = self.build_window_nodes_from_documents([node])
            all_nodes.extend(nodes)

        return all_nodes

    def build_window_nodes_from_documents(
        self, documents: Sequence[Document]
    ) -> List[BaseNode]:
        """Build window nodes from documents."""
        all_nodes: List[BaseNode] = []
        for doc in documents:
            text = doc.text
            text_splits = self.sentence_splitter(text)
            nodes = build_nodes_from_splits(
                text_splits,
                doc,
                id_func=self.id_func,
            )

            # Add window to each node
            for i, node in enumerate(nodes):
                window_nodes = []
                window_token_count = 0
                for j in range(max(0, i - self.window_size), min(i + self.window_size + 1, len(nodes))):
                    window_node = nodes[j]
                    window_node_tokens = self.tokenizer(window_node.text, return_tensors="pt")
                    window_token_count += window_node_tokens.input_ids.size(-1)
                    if window_token_count > self.window_token_size:
                        break
                    window_nodes.append(window_node)

                node.metadata[self.window_metadata_key] = " ".join([n.text for n in window_nodes])
                node.metadata[self.original_text_metadata_key] = node.text

                # Exclude window metadata from embed and LLM
                node.excluded_embed_metadata_keys.extend(
                    [self.window_metadata_key, self.original_text_metadata_key]
                )
                node.excluded_llm_metadata_keys.extend(
                    [self.window_metadata_key, self.original_text_metadata_key]
                )

        all_nodes.extend(nodes)
        return all_nodes