run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.26k stars 4.96k forks source link

[Bug]: ValueError: Got a larger chunk overlap (20) than chunk size (100), should be smaller. #14985

Closed sby-a-izumi closed 1 month ago

sby-a-izumi commented 1 month ago

Bug Description

from llama_index.core.node_parser import SentenceSplitter

SentenceSplitter(chunk_size=100, chunk_overlap=20, separator='\n')

Splitter = SentenceSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap, separator=separator) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

ValueError: Got a larger chunk overlap (20) than chunk size (100), should be smaller.

Version

0.10.43

Steps to Reproduce

as the same with bug description.

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 1 month ago

@sby-a-izumi Hello there! I'm here to help you with any bugs, questions, or contributions you might have. Let's squash those bugs together!

The ValueError occurs because the SentenceSplitter class in llama_index.core.node_parser has a validation check that ensures the chunk_overlap is not greater than the chunk_size. The relevant code snippet is:

if chunk_overlap > chunk_size:
    raise ValueError(
        f"Got a larger chunk overlap ({chunk_overlap}) than chunk size "
        f"({chunk_size}), should be smaller."
    )

In your case, you are using chunk_size=100 and chunk_overlap=20, which should not trigger this error. However, the default value for chunk_overlap is DEFAULT_CHUNK_OVERLAP, which is set to 200. If you did not explicitly set chunk_overlap to 20, it would use the default value of 200, which is greater than the chunk_size of 100, thus causing the error.

To resolve this, ensure that you explicitly set chunk_overlap to 20 when initializing the SentenceSplitter:

splitter = SentenceSplitter(chunk_size=100, chunk_overlap=20, separator='\n')

This will prevent the ValueError from being raised [1][2].

To continue talking to Dosu, mention @dosu.

logan-markewich commented 1 month ago

Not able to reproduce https://colab.research.google.com/drive/1-3RMZ0Mx2Yz-kVO0RcMOJaTj53G-2LV3?usp=sharing

sby-a-izumi commented 1 month ago

This error often occurred when performing an operation that overwrites the same address. The solution was to cast the type so that the error did not occur.