Open loperamos opened 3 months ago
Adding some post clarification, I believe that the expected behaviour should be as the following code snippet, since I would expect we want to keep the chunk exactly as the original text. The keep separator flag should only affect when the separator is at the begining of a chunk.
if __name__ == "__main__":
# Wrong behaviour, using \s instead of regular space
splitter_keep = RecursiveCharacterTextSplitter(
separators=[r"\s"],
keep_separator=False,
is_separator_regex=True,
chunk_size=15,
chunk_overlap=0,
strip_whitespace=False)
assert splitter_keep.split_text("Hello world")[0] == r"Hello world"
I drafted a fix for it and will open an pull request once I have the time to add some test cases tomorrow
Checked other resources
Example Code
Error Message and Stack Trace (if applicable)
No response
Description
I am trying to use the
langchain
library to split a test using regex separators. I expect the output strings to contain the original separators, but what happens is that when using thekeep_separator
flag asFalse
it uses the regex value instead of the original separator.Possible code pointer where the problem might be coming from: libs/text-splitters/langchain_text_splitters/character.py#L98
System Info
langchain==0.2.5 langchain-core==0.2.9 langchain-text-splitters==0.2.1
Platform: Apple M1 Pro macOS: 14.5 (23F79)
python version: Python 3.12.3