Closed danielclymer closed 1 month ago
🤖
Hey @danielclymer, welcome back! Hope you're ready for another deep dive into the world of code with us. 😊
Thank you for the detailed report. It seems like you've identified a potential issue with the add_start_index
option in the RecursiveCharacterTextSplitter
class when splitting text by token count using the from_tiktoken_encoder
method with a chunk_overlap
greater than 0.
As you've correctly pointed out, the issue seems to stem from the calculation of the offset
in the create_documents
method. The offset
is calculated as a number of characters, but when splitting by token count, the chunk_overlap
is specified as a token count. This discrepancy can lead to incorrect start_index
values.
This issue appears to be similar to a previously resolved issue in the LangChain repository: [BUG] Inconsistent results with RecursiveCharacterTextSplitter
's add_start_index=True
. The solution to that issue was implemented in PR #16583 and merged into the main branch.
However, it's important to note that the previous issue was not exactly the same as the one you're experiencing. The previous issue was related to inconsistent results with the add_start_index
option, but it did not specifically address the problem of splitting by token count with a chunk_overlap
greater than 0.
Unfortunately, without the ability to create pull requests or issues, I'm unable to provide a direct solution to this problem. However, I would recommend checking if the issue persists in the latest version of LangChain (v0.1.7) and if the problem is not resolved, it might be worth bringing this to the attention of the LangChain developers directly.
In the meantime, as a workaround, you might consider adjusting your chunk_overlap
value or splitting by character count instead of token count, if possible.
I hope this information is helpful and I appreciate your understanding. If you have any further questions or need additional clarification, please don't hesitate to ask.
RecursiveCharacterTextSplitter
's add_start_index=True
Checked other resources
Example Code
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_to_split = 'any text can be put here if I am splitting from_tiktoken_encoder and have a chunk_overlap greater than 0 it will not work. The start_index metadata will have intermittant -1 values in it.' text_splitter = RecursiveCharacterTextSplitter(length_function=len, is_separator_regex=False).from_tiktoken_encoder( chunk_size=20, chunk_overlap=10, )
split_texts = text_splitter.create_documents([text_to_split])
Error Message and Stack Trace (if applicable)
No response
Description
Basically the error comes if you are splitting "from_tiktoken_encoder" rather than splitting by character count, and if you are specifying a chunk_overlap greater than 0. The error is caused by line 150 of text_splitter.py:
offset = index + previous_chunk_len - self._chunk_overlap
It won't calculate the correct offset because out self._chunk_overlap is specified as a token count, but that line in the code is calculating offset as a number of characters.
System Info
aiohttp==3.9.3 aiosignal==1.3.1 anyio==3.5.0 argon2-cffi==21.3.0 argon2-cffi-bindings==21.2.0 asttokens==2.0.5 async-timeout==4.0.3 attrs==22.1.0 backcall==0.2.0 beautifulsoup4==4.11.1 black==22.6.0 bleach==4.1.0 blinker==1.4 boto3==1.24.28 botocore==1.27.96 certifi==2022.12.7 cffi==1.15.1 chardet==4.0.0 charset-normalizer==2.0.4 click==8.0.4 comm==0.1.2 contourpy==1.0.5 cryptography==39.0.1 cycler==0.11.0 Cython==0.29.32 databricks-sdk==0.1.6 dataclasses-json==0.6.4 dbus-python==1.2.18 debugpy==1.6.7 decorator==5.1.1 defusedxml==0.7.1 distlib==0.3.7 distro==1.7.0 distro-info==1.1+ubuntu0.2 docopt==0.6.2 docstring-to-markdown==0.11 entrypoints==0.4 executing==0.8.3 facets-overview==1.1.1 fastjsonschema==2.19.0 filelock==3.12.4 fonttools==4.25.0 frozenlist==1.4.1 googleapis-common-protos==1.61.0 greenlet==3.0.3 grpcio==1.48.2 grpcio-status==1.48.1 h11==0.14.0 httpcore==1.0.3 httplib2==0.20.2 httpx==0.26.0 idna==3.4 importlib-metadata==4.6.4 ipykernel==6.25.0 ipython==8.14.0 ipython-genutils==0.2.0 ipywidgets==7.7.2 jedi==0.18.1 jeepney==0.7.1 Jinja2==3.1.2 jmespath==0.10.0 joblib==1.2.0 jsonpatch==1.33 jsonpointer==2.4 jsonschema==4.17.3 jupyter-client==7.3.4 jupyter-server==1.23.4 jupyter_core==5.2.0 jupyterlab-pygments==0.1.2 jupyterlab-widgets==1.0.0 keyring==23.5.0 kiwisolver==1.4.4 langchain==0.1.7 langchain-community==0.0.20 langchain-core==0.1.23 langsmith==0.0.87 launchpadlib==1.10.16 lazr.restfulclient==0.14.4 lazr.uri==1.0.6 lxml==4.9.1 MarkupSafe==2.1.1 marshmallow==3.20.2 matplotlib==3.7.0 matplotlib-inline==0.1.6 mccabe==0.7.0 mistune==0.8.4 more-itertools==8.10.0 multidict==6.0.5 mypy-extensions==0.4.3 nbclassic==0.5.2 nbclient==0.5.13 nbconvert==6.5.4 nbformat==5.7.0 nest-asyncio==1.5.6 nodeenv==1.8.0 notebook==6.5.2 notebook_shim==0.2.2 num2words==0.5.13 numpy==1.23.5 oauthlib==3.2.0 openai==1.12.0 packaging==23.2 pandas==1.5.3 pandocfilters==1.5.0 parso==0.8.3 pathspec==0.10.3 patsy==0.5.3 pexpect==4.8.0 pickleshare==0.7.5 Pillow==9.4.0 platformdirs==2.5.2 plotly==5.9.0 pluggy==1.0.0 prometheus-client==0.14.1 prompt-toolkit==3.0.36 protobuf==4.24.0 psutil==5.9.0 psycopg2==2.9.3 ptyprocess==0.7.0 pure-eval==0.2.2 pyarrow==8.0.0 pyarrow-hotfix==0.5 pycparser==2.21 pydantic==1.10.6 pyflakes==3.1.0 Pygments==2.11.2 PyGObject==3.42.1 PyJWT==2.3.0 pyodbc==4.0.32 pyparsing==3.0.9 pyright==1.1.294 pyrsistent==0.18.0 python-apt==2.4.0+ubuntu2 python-dateutil==2.8.2 python-lsp-jsonrpc==1.1.1 python-lsp-server==1.8.0 pytoolconfig==1.2.5 pytz==2022.7 PyYAML==6.0.1 pyzmq==23.2.0 regex==2023.12.25 requests==2.28.1 rope==1.7.0 s3transfer==0.6.2 scikit-learn==1.1.1 scipy==1.10.0 seaborn==0.12.2 SecretStorage==3.3.1 Send2Trash==1.8.0 six==1.16.0 sniffio==1.2.0 soupsieve==2.3.2.post1 SQLAlchemy==2.0.27 ssh-import-id==5.11 stack-data==0.2.0 statsmodels==0.13.5 tenacity==8.1.0 terminado==0.17.1 threadpoolctl==2.2.0 tiktoken==0.6.0 tinycss2==1.2.1 tokenize-rt==4.2.1 tomli==2.0.1 tornado==6.1 tqdm==4.66.2 traitlets==5.7.1 typing-inspect==0.9.0 typing_extensions==4.9.0 ujson==5.4.0 unattended-upgrades==0.1 urllib3==1.26.14 virtualenv==20.16.7 wadllib==1.3.6 wcwidth==0.2.5 webencodings==0.5.1 websocket-client==0.58.0 whatthepatch==1.0.2 widgetsnbextension==3.6.1 yapf==0.33.0 yarl==1.9.4 zipp==1.0.0