RecursiveCharacterTextSplitter uses regex value instead of original separator when merging and keep_separator is false

loperamos commented 3 months ago

Checked other resources

[X] I added a very descriptive title to this issue.
[X] I searched the LangChain documentation with the integrated search.
[X] I used the GitHub search to find a similar question and didn't find it.
[X] I am sure that this is a bug in LangChain rather than my code.
[X] The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain.text_splitter import RecursiveCharacterTextSplitter

if __name__ == "__main__":
    # Wrong behaviour, using \s instead of regular space
    splitter_keep = RecursiveCharacterTextSplitter(
        separators=[r"\s"],
        keep_separator=False,
        is_separator_regex=True,
        chunk_size=15,
        chunk_overlap=0,
        strip_whitespace=False)
    assert splitter_keep.split_text("Hello world")[0] == r"Hello\sworld"

    # Expected behaviour, keeping regular space
    splitter_no_keep = RecursiveCharacterTextSplitter(
        separators=[r"\s"],
        keep_separator=True,
        is_separator_regex=True,
        chunk_size=15,
        chunk_overlap=0,
        strip_whitespace=False)
    assert splitter_no_keep.split_text("Hello world")[0] == r"Hello world"

Error Message and Stack Trace (if applicable)

No response

Description

I am trying to use the langchain library to split a test using regex separators. I expect the output strings to contain the original separators, but what happens is that when using the keep_separator flag as False it uses the regex value instead of the original separator.

Possible code pointer where the problem might be coming from: libs/text-splitters/langchain_text_splitters/character.py#L98

System Info

langchain==0.2.5 langchain-core==0.2.9 langchain-text-splitters==0.2.1

Platform: Apple M1 Pro macOS: 14.5 (23F79)

python version: Python 3.12.3

loperamos commented 3 months ago

Adding some post clarification, I believe that the expected behaviour should be as the following code snippet, since I would expect we want to keep the chunk exactly as the original text. The keep separator flag should only affect when the separator is at the begining of a chunk.

if __name__ == "__main__":
    # Wrong behaviour, using \s instead of regular space
    splitter_keep = RecursiveCharacterTextSplitter(
        separators=[r"\s"],
        keep_separator=False,
        is_separator_regex=True,
        chunk_size=15,
        chunk_overlap=0,
        strip_whitespace=False)
    assert splitter_keep.split_text("Hello world")[0] == r"Hello world"

wulifu2hao commented 3 months ago

I drafted a fix for it and will open an pull request once I have the time to add some test cases tomorrow

langchain-ai / langchain