Closed preemoDez closed 5 months ago
🤖
Hey @preemoDez, great to see you diving deep again! 🚀
Based on the traceback you provided, it seems like the RuntimeError(StackOverflow)
is being caused by a recursive call in the _split
method of the SentenceSplitter
class. If the text being split is too large or complex, it could lead to a stack overflow due to the depth of the recursive calls.
Here is the relevant code from the SentenceSplitter
class:
def _split(self, text: str, chunk_size: int) -> List[str]:
"""Break text into splits that are smaller than chunk size."""
if len(self._tokenizer(text)) <= chunk_size:
return [text]
for split_fn in self._split_fns:
splits = split_fn(text)
if len(splits) > 1:
break
new_splits = []
for split in splits:
split_len = len(self._tokenizer(split))
if split_len <= chunk_size:
new_splits.append(split)
else:
# recursively split
new_splits.extend(self._split(split, chunk_size=chunk_size))
return new_splits
To resolve this issue, you could refactor the _split
method to use an iterative approach instead of a recursive one. This would prevent the stack overflow error by avoiding the deep call stack that comes with recursion. However, this might require a significant change in the logic of the method, and you would need to ensure that the new implementation still correctly splits the text as required.
As for the SentenceSplitter
class, it is used to split a given text into smaller chunks or sentences. It takes in parameters such as chunk_size
and chunk_overlap
to determine the size of each chunk and the overlap between them. The split_text
method is used to split a single text into chunks, while the get_nodes_from_documents
method is used to get nodes from documents, where each node has a start and end character index.
I hope this helps! If you have any other questions or need further clarification, feel free to ask.
Bug Description
output-onlinefiletools.txt I tried to split the above file, using Llama Index Python. In #10554 I mentioned that this whole algorithm is $O(n^2)$, however, in this particular case it looks like the tokenizer causes the stack overflow:
token_size = self._token_size(text)
.A small issue is that
self._token_size(text)
is called twice:In the lower if, you can reuse the
token_size
calculated above.Version
0.9.39
Steps to Reproduce
Relevant Logs/Tracbacks