Closed antoniolanza1996 closed 7 months ago
🤖
Hello @antoniolanza1996! Good to see you again. Thanks for diving deep into the LangChain and coming up with such detailed observations. Let's get to the bottom of this!
Thank you for bringing this issue to our attention. The problem you're experiencing with the RecursiveCharacterTextSplitter
in LangChain v0.0.334 is due to the way the start_index
is calculated in the create_documents
method.
In the current implementation, the start_index
is calculated using the find
method on the original text with the chunk as the parameter. This method returns the first occurrence of the chunk in the text, which can lead to incorrect start_index
values if the chunk appears more than once in the text.
Here is the problematic code:
def create_documents(
self, texts: List[str], metadatas: Optional[List[dict]] = None
) -> List[Document]:
"""Create documents from a list of texts."""
_metadatas = metadatas or [{}] * len(texts)
documents = []
for i, text in enumerate(texts):
index = -1
for chunk in self.split_text(text):
metadata = copy.deepcopy(_metadatas[i])
if self._add_start_index:
index = text.find(chunk, index + 1)
metadata["start_index"] = index
new_doc = Document(page_content=chunk, metadata=metadata)
documents.append(new_doc)
return documents
To resolve this issue, you can modify the create_documents
method to calculate the start_index
based on the length of the chunks instead of using the find
method. This will ensure that the start_index
is correctly calculated for each chunk, even if the chunk appears more than once in the text.
Here is the modified code:
def create_documents(
self, texts: List[str], metadatas: Optional[List[dict]] = None
) -> List[Document]:
"""Create documents from a list of texts."""
_metadatas = metadatas or [{}] * len(texts)
documents = []
for i, text in enumerate(texts):
index = 0
for chunk in self.split_text(text):
metadata = copy.deepcopy(_metadatas[i])
if self._add_start_index:
metadata["start_index"] = index
index += len(chunk)
new_doc = Document(page_content=chunk, metadata=metadata)
documents.append(new_doc)
return documents
This modification ensures that the start_index
is correctly calculated for each chunk, preventing overlaps in the text passages.
Please note that this is a workaround and not an official fix. The LangChain team will need to address this issue in a future release.
I hope this helps! If you have any other questions, feel free to ask.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
Nice catch for the bug: it was the wrong use of find
...
However, the provided solution is wrong (let's also consider that there is the strip_whitespace
parameter can strip the whitespaces).
A possible solution has been pushed in PR #16583
PR #16583 merged. This issue is fixed now
Checked other resources
Example Code
Error Message and Stack Trace (if applicable)
No response
Description
I'm trying to use
RecursiveCharacterTextSplitter
withadd_start_index=True
but I found out some texts where thestart_index
is wrong. For example:text1
in the code, the 6th passage has (4412, 4418) but it's overlapped with the 5th passage that has (4088, 5111)... this is wrongtext1
str (i.e.text2
), now the 6th passage has (5112, 5119) and it's correctSystem Info
langchain 0.0.334 with python 3.8