langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
92.96k stars 14.92k forks source link

RecursiveCharacterTextSplitter uses regex value instead of original separator when merging and keep_separator is false #23394

Open loperamos opened 3 months ago

loperamos commented 3 months ago

Checked other resources

Example Code

from langchain.text_splitter import RecursiveCharacterTextSplitter

if __name__ == "__main__":
    # Wrong behaviour, using \s instead of regular space
    splitter_keep = RecursiveCharacterTextSplitter(
        separators=[r"\s"],
        keep_separator=False,
        is_separator_regex=True,
        chunk_size=15,
        chunk_overlap=0,
        strip_whitespace=False)
    assert splitter_keep.split_text("Hello world")[0] == r"Hello\sworld"

    # Expected behaviour, keeping regular space
    splitter_no_keep = RecursiveCharacterTextSplitter(
        separators=[r"\s"],
        keep_separator=True,
        is_separator_regex=True,
        chunk_size=15,
        chunk_overlap=0,
        strip_whitespace=False)
    assert splitter_no_keep.split_text("Hello world")[0] == r"Hello world"

Error Message and Stack Trace (if applicable)

No response

Description

I am trying to use the langchain library to split a test using regex separators. I expect the output strings to contain the original separators, but what happens is that when using the keep_separator flag as False it uses the regex value instead of the original separator.

Possible code pointer where the problem might be coming from: libs/text-splitters/langchain_text_splitters/character.py#L98

System Info

langchain==0.2.5 langchain-core==0.2.9 langchain-text-splitters==0.2.1

Platform: Apple M1 Pro macOS: 14.5 (23F79)

python version: Python 3.12.3

loperamos commented 3 months ago

Adding some post clarification, I believe that the expected behaviour should be as the following code snippet, since I would expect we want to keep the chunk exactly as the original text. The keep separator flag should only affect when the separator is at the begining of a chunk.

if __name__ == "__main__":
    # Wrong behaviour, using \s instead of regular space
    splitter_keep = RecursiveCharacterTextSplitter(
        separators=[r"\s"],
        keep_separator=False,
        is_separator_regex=True,
        chunk_size=15,
        chunk_overlap=0,
        strip_whitespace=False)
    assert splitter_keep.split_text("Hello world")[0] == r"Hello world"
wulifu2hao commented 3 months ago

I drafted a fix for it and will open an pull request once I have the time to add some test cases tomorrow