run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.25k stars 4.95k forks source link

[Question]: Sentence Splitter with hyperlinks #14805

Open truski22 opened 1 month ago

truski22 commented 1 month ago

Question Validation

Question

I am implementing a sentence splitter for texts. If the text contains hyperlinks, it behaves differently if I run the splitter within an IngestionPipeline compared to running it outside the IngestionPipeline as follows: file_splitted = splitter.split_text_metadata_aware(text, node.get_metadata_str()) Hyperlinks display differently but the rest of the text doesn't, I'm confused about this behavior.

dosubot[bot] commented 1 month ago

Hey @truski22! 👋 I'm here to assist you with any bugs, questions, or contributions while you wait for a human maintainer. Let's squash those bugs together!

To ensure consistent hyperlink handling both inside and outside the IngestionPipeline when using the splitter.split_text_metadata_aware method, you need to customize the tokenizer or the splitting functions to recognize and preserve hyperlinks.

Here's an example of how you might modify the SentenceSplitter class to handle hyperlinks consistently:

  1. Custom Tokenizer: Modify the tokenizer to treat hyperlinks as single tokens.
  2. Custom Split Functions: Ensure that the split functions do not break hyperlinks.
import re
from typing import List, Optional, Callable
from llama_index.core.node_parser.text.sentence import SentenceSplitter
from llama_index.core.callbacks.base import CallbackManager
from llama_index.core.schema import Document

def custom_tokenizer(text: str) -> List[str]:
    """Custom tokenizer that treats hyperlinks as single tokens."""
    # Regex to match hyperlinks
    hyperlink_regex = r'https?://\S+|www\.\S+'
    tokens = re.findall(hyperlink_regex, text)
    # Split the text by spaces and add the hyperlinks
    return re.split(r'(\s+)', text) + tokens

class CustomSentenceSplitter(SentenceSplitter):
    def __init__(
        self,
        chunk_size: int = SentenceSplitter.chunk_size,
        chunk_overlap: int = SentenceSplitter.chunk_overlap,
        tokenizer: Optional[Callable] = custom_tokenizer,
        callback_manager: Optional[CallbackManager] = None,
        separator: str = " ",
        paragraph_separator: str = SentenceSplitter.paragraph_separator,
        secondary_chunking_regex: str = SentenceSplitter.secondary_chunking_regex,
        include_metadata: bool = True,
        include_prev_next_rel: bool = True,
        id_func: Optional[Callable[[int, Document], str]] = None,
    ):
        super().__init__(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            tokenizer=tokenizer,
            callback_manager=callback_manager,
            separator=separator,
            paragraph_separator=paragraph_separator,
            secondary_chunking_regex=secondary_chunking_regex,
            include_metadata=include_metadata,
            include_prev_next_rel=include_prev_next_rel,
            id_func=id_func,
        )

    def _split(self, text: str, chunk_size: int) -> List[str]:
        """Override _split to ensure hyperlinks are not broken."""
        if len(self._tokenizer(text)) <= chunk_size:
            return [text]

        for split_fn in self._split_fns:
            splits = split_fn(text)
            if len(splits) > 1:
                break

        new_splits = []
        for split in splits:
            split_len = len(self._tokenizer(split))
            if split_len <= chunk_size:
                new_splits.append(split)
            else:
                # recursively split
                new_splits.extend(self._split(split, chunk_size=chunk_size))
        return new_splits

# Usage
splitter = CustomSentenceSplitter()
chunks = splitter.split_text_metadata_aware(text, metadata_str)

In this example, the custom_tokenizer function uses a regular expression to identify hyperlinks and treat them as single tokens. The CustomSentenceSplitter class then uses this tokenizer to ensure that hyperlinks are not broken during the splitting process. This approach ensures consistent handling of hyperlinks both inside and outside the IngestionPipeline [1][2][3][4].

To continue talking to Dosu, mention @dosu.

truski22 commented 1 month ago

@dosu Why with the IngestionPipeline doesn't appear the problem? The problem is when I move the splitter out the IngestionPipeline

dosubot[bot] commented 1 month ago

The problem with hyperlink display occurring only outside the IngestionPipeline is likely due to the transformations applied within the IngestionPipeline. Specifically, the SentenceSplitter and HuggingFaceEmbedding transformations might be altering the text in a way that affects hyperlink handling.

To ensure consistent behavior for hyperlink handling both inside and outside the IngestionPipeline, you need to ensure that the transformations applied within the IngestionPipeline do not alter the hyperlinks. You can achieve this by modifying or adding custom transformations that preserve hyperlinks.

Here is an example of how you might modify the IngestionPipeline to include a custom transformation that preserves hyperlinks:

from llama_index.core.ingestion.pipeline import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.core.schema import Document

class HyperlinkPreservingSplitter(SentenceSplitter):
    def split(self, text: str) -> list:
        # Custom logic to split text while preserving hyperlinks
        # This is a placeholder implementation
        return super().split(text)

# Example usage
documents = [
    Document(text="This is a test document with a hyperlink: http://example.com", doc_id="1"),
]

pipeline = IngestionPipeline(
    transformations=[
        HyperlinkPreservingSplitter(),
        HuggingFaceEmbedding(model_name="BAAI/bge-m3"),
    ],
    docstore=SimpleDocumentStore(),
)

nodes = pipeline.run(documents=documents)

By using a custom HyperlinkPreservingSplitter, you can ensure that hyperlinks are preserved during the ingestion process, leading to consistent behavior both inside and outside the IngestionPipeline [1][2].

To continue talking to Dosu, mention @dosu.

truski22 commented 1 month ago

It doesn't solve my problem at all. This is the original text for example:

[Buscar](https://es.wikipedia.org/wiki/Wikipedia:Portada) •[Crear una cuenta](https://es.wikipedia.org/w/index.php?title=Especial:Crear_una_cuenta&returnto=Copa+Mundial+de+F%C3%BAtbol)•[Acceder](https://es.wikipedia.org/w/index.php?title=Especial:Crear_una_cuenta&returnto=Copa+Mundial+de+F%C3%BAtbol)Herra[mientas](https://es.wikipedia.org/w/index.php?title=Especial:Entrar&returnto=Copa+Mundial+de+F%C3%BAtbol) personales  |

After the splitter, these two lines:

text = node.byte_content.decode(encoding=node.encoding) if isinstance(node,RawDocument) else node.text

file_splitted = splitter.split_text_metadata_aware(text, node.get_metadata_str())

the text is transformed to this:

[Buscar](https://es.wikipedia.org/wiki/Wikipedia:Portada) •[Crear un](https://es.wikipedia.org/w/index.php?title=Especial:Entrar&returnto=Copa+Mundial+de+F%C3%BAtbol)[a cuenta](https://es.wikipedia.org/w/index.php?title=Especial:Crear_una_cuenta&returnto=Copa+Mundial+de+F%C3%BAtbol)•[Acceder](https://es.wikipedia.org/w/index.php?title=Especial:Entrar&returnto=Copa+Mundial+de+F%C3%BAtbol)Herra[mientas](https://es.wikipedia.org/w/index.php?title=Especial:Entrar&returnto=Copa+Mundial+de+F%C3%BAtbol) personales  |

There are a few differences that i don't understand why they appear. Other examples: Original:

• [Artículo](https://es.wikipedia.org/wiki/Copa_Mundial_de_F%C3%BAtbol)

After splitter:

• [Artículo](https://es.wikipedia.org/wiki/Discusi%C3%B3n:Copa_Mundial_de_F%C3%BAtbol)