langchain.textsplitter "add_start_index" option broken for create_documents() when splitting text by token count rather than character count

Checked other resources

[X] I added a very descriptive title to this issue.
[X] I searched the LangChain documentation with the integrated search.
[X] I used the GitHub search to find a similar question and didn't find it.
[X] I am sure that this is a bug in LangChain rather than my code.

Example Code

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_to_split = 'any text can be put here if I am splitting from_tiktoken_encoder and have a chunk_overlap greater than 0 it will not work. The start_index metadata will have intermittant -1 values in it.' text_splitter = RecursiveCharacterTextSplitter(length_function=len, is_separator_regex=False).from_tiktoken_encoder( chunk_size=20, chunk_overlap=10, )

split_texts = text_splitter.create_documents([text_to_split])

Error Message and Stack Trace (if applicable)

No response

Description

Basically the error comes if you are splitting "from_tiktoken_encoder" rather than splitting by character count, and if you are specifying a chunk_overlap greater than 0. The error is caused by line 150 of text_splitter.py:

offset = index + previous_chunk_len - self._chunk_overlap

It won't calculate the correct offset because out self._chunk_overlap is specified as a token count, but that line in the code is calculating offset as a number of characters.

System Info

aiohttp==3.9.3 aiosignal==1.3.1 anyio==3.5.0 argon2-cffi==21.3.0 argon2-cffi-bindings==21.2.0 asttokens==2.0.5 async-timeout==4.0.3 attrs==22.1.0 backcall==0.2.0 beautifulsoup4==4.11.1 black==22.6.0 bleach==4.1.0 blinker==1.4 boto3==1.24.28 botocore==1.27.96 certifi==2022.12.7 cffi==1.15.1 chardet==4.0.0 charset-normalizer==2.0.4 click==8.0.4 comm==0.1.2 contourpy==1.0.5 cryptography==39.0.1 cycler==0.11.0 Cython==0.29.32 databricks-sdk==0.1.6 dataclasses-json==0.6.4 dbus-python==1.2.18 debugpy==1.6.7 decorator==5.1.1 defusedxml==0.7.1 distlib==0.3.7 distro==1.7.0 distro-info==1.1+ubuntu0.2 docopt==0.6.2 docstring-to-markdown==0.11 entrypoints==0.4 executing==0.8.3 facets-overview==1.1.1 fastjsonschema==2.19.0 filelock==3.12.4 fonttools==4.25.0 frozenlist==1.4.1 googleapis-common-protos==1.61.0 greenlet==3.0.3 grpcio==1.48.2 grpcio-status==1.48.1 h11==0.14.0 httpcore==1.0.3 httplib2==0.20.2 httpx==0.26.0 idna==3.4 importlib-metadata==4.6.4 ipykernel==6.25.0 ipython==8.14.0 ipython-genutils==0.2.0 ipywidgets==7.7.2 jedi==0.18.1 jeepney==0.7.1 Jinja2==3.1.2 jmespath==0.10.0 joblib==1.2.0 jsonpatch==1.33 jsonpointer==2.4 jsonschema==4.17.3 jupyter-client==7.3.4 jupyter-server==1.23.4 jupyter_core==5.2.0 jupyterlab-pygments==0.1.2 jupyterlab-widgets==1.0.0 keyring==23.5.0 kiwisolver==1.4.4 langchain==0.1.7 langchain-community==0.0.20 langchain-core==0.1.23 langsmith==0.0.87 launchpadlib==1.10.16 lazr.restfulclient==0.14.4 lazr.uri==1.0.6 lxml==4.9.1 MarkupSafe==2.1.1 marshmallow==3.20.2 matplotlib==3.7.0 matplotlib-inline==0.1.6 mccabe==0.7.0 mistune==0.8.4 more-itertools==8.10.0 multidict==6.0.5 mypy-extensions==0.4.3 nbclassic==0.5.2 nbclient==0.5.13 nbconvert==6.5.4 nbformat==5.7.0 nest-asyncio==1.5.6 nodeenv==1.8.0 notebook==6.5.2 notebook_shim==0.2.2 num2words==0.5.13 numpy==1.23.5 oauthlib==3.2.0 openai==1.12.0 packaging==23.2 pandas==1.5.3 pandocfilters==1.5.0 parso==0.8.3 pathspec==0.10.3 patsy==0.5.3 pexpect==4.8.0 pickleshare==0.7.5 Pillow==9.4.0 platformdirs==2.5.2 plotly==5.9.0 pluggy==1.0.0 prometheus-client==0.14.1 prompt-toolkit==3.0.36 protobuf==4.24.0 psutil==5.9.0 psycopg2==2.9.3 ptyprocess==0.7.0 pure-eval==0.2.2 pyarrow==8.0.0 pyarrow-hotfix==0.5 pycparser==2.21 pydantic==1.10.6 pyflakes==3.1.0 Pygments==2.11.2 PyGObject==3.42.1 PyJWT==2.3.0 pyodbc==4.0.32 pyparsing==3.0.9 pyright==1.1.294 pyrsistent==0.18.0 python-apt==2.4.0+ubuntu2 python-dateutil==2.8.2 python-lsp-jsonrpc==1.1.1 python-lsp-server==1.8.0 pytoolconfig==1.2.5 pytz==2022.7 PyYAML==6.0.1 pyzmq==23.2.0 regex==2023.12.25 requests==2.28.1 rope==1.7.0 s3transfer==0.6.2 scikit-learn==1.1.1 scipy==1.10.0 seaborn==0.12.2 SecretStorage==3.3.1 Send2Trash==1.8.0 six==1.16.0 sniffio==1.2.0 soupsieve==2.3.2.post1 SQLAlchemy==2.0.27 ssh-import-id==5.11 stack-data==0.2.0 statsmodels==0.13.5 tenacity==8.1.0 terminado==0.17.1 threadpoolctl==2.2.0 tiktoken==0.6.0 tinycss2==1.2.1 tokenize-rt==4.2.1 tomli==2.0.1 tornado==6.1 tqdm==4.66.2 traitlets==5.7.1 typing-inspect==0.9.0 typing_extensions==4.9.0 ujson==5.4.0 unattended-upgrades==0.1 urllib3==1.26.14 virtualenv==20.16.7 wadllib==1.3.6 wcwidth==0.2.5 webencodings==0.5.1 websocket-client==0.58.0 whatthepatch==1.0.2 widgetsnbextension==3.6.1 yapf==0.33.0 yarl==1.9.4 zipp==1.0.0

🤖

Hey @danielclymer, welcome back! Hope you're ready for another deep dive into the world of code with us. 😊

Thank you for the detailed report. It seems like you've identified a potential issue with the add_start_index option in the RecursiveCharacterTextSplitter class when splitting text by token count using the from_tiktoken_encoder method with a chunk_overlap greater than 0.

As you've correctly pointed out, the issue seems to stem from the calculation of the offset in the create_documents method. The offset is calculated as a number of characters, but when splitting by token count, the chunk_overlap is specified as a token count. This discrepancy can lead to incorrect start_index values.

This issue appears to be similar to a previously resolved issue in the LangChain repository: [BUG] Inconsistent results with RecursiveCharacterTextSplitter's add_start_index=True. The solution to that issue was implemented in PR #16583 and merged into the main branch.

However, it's important to note that the previous issue was not exactly the same as the one you're experiencing. The previous issue was related to inconsistent results with the add_start_index option, but it did not specifically address the problem of splitting by token count with a chunk_overlap greater than 0.

Unfortunately, without the ability to create pull requests or issues, I'm unable to provide a direct solution to this problem. However, I would recommend checking if the issue persists in the latest version of LangChain (v0.1.7) and if the problem is not resolved, it might be worth bringing this to the attention of the LangChain developers directly.

In the meantime, as a workaround, you might consider adjusting your chunk_overlap value or splitting by character count instead of token count, if possible.

I hope this information is helpful and I appreciate your understanding. If you have any further questions or need additional clarification, please don't hesitate to ask.

Sources

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

langchain-ai / langchain