[Bug]: SentenceSplitter_split crashes on a sequence of 10 MB lower letters

Bug Description

output-onlinefiletools.txt I tried to split the above file, using Llama Index Python. In #10554 I mentioned that this whole algorithm is $O(n^2)$, however, in this particular case it looks like the tokenizer causes the stack overflow: token_size = self._token_size(text).

A small issue is that self._token_size(text) is called twice:

token_size = self._token_size(text)
if self._token_size(text) <= chunk_size:

In the lower if, you can reuse the token_size calculated above.

Version

0.9.39

Steps to Reproduce

reader = SimpleDirectoryReader(input_files=[filepath])
documents = reader.load_data()
splitter = SentenceSplitter.from_defaults(
    chunk_overlap=32,
    chunk_size=256,
)
llama_nodes = splitter.get_nodes_from_documents(documents=documents)

Relevant Logs/Tracbacks

thread '<unnamed>' panicked at src/lib.rs:250:33:
called `Result::unwrap()` on an `Err` value: RuntimeError(StackOverflow)
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
2024-02-23 16:22:16.238 | ERROR    | uvicorn.protocols.http.httptools_impl:run_asgi:431 - Exception in ASGI application

Traceback (most recent call last):

  File "<string>", line 1, in <module>
  File "/nix/store/fcdizvgrhss6yw5p0hm37423i2h4g53f-python3-3.10.12/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
               │     │   └ 4
               │     └ 7
               └ <function _main at 0x105545cf0>
  File "/nix/store/fcdizvgrhss6yw5p0hm37423i2h4g53f-python3-3.10.12/lib/python3.10/multiprocessing/spawn.py", line 129, in _main
    return self._bootstrap(parent_sentinel)
           │    │          └ 4
           │    └ <function BaseProcess._bootstrap at 0x10545bb50>
           └ <SpawnProcess name='SpawnProcess-43' parent=2850 started>
  File "/nix/store/fcdizvgrhss6yw5p0hm37423i2h4g53f-python3-3.10.12/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
    │    └ <function BaseProcess.run at 0x10545b1c0>
    └ <SpawnProcess name='SpawnProcess-43' parent=2850 started>
  File "/nix/store/fcdizvgrhss6yw5p0hm37423i2h4g53f-python3-3.10.12/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
    │    │        │    │        │    └ {'config': <uvicorn.config.Config object at 0x1055276d0>, 'target': <bound method Server.run of <uvicorn.server.Server object...
    │    │        │    │        └ <SpawnProcess name='SpawnProcess-43' parent=2850 started>
    │    │        │    └ ()
    │    │        └ <SpawnProcess name='SpawnProcess-43' parent=2850 started>
    │    └ <function subprocess_started at 0x1072e7eb0>
    └ <SpawnProcess name='SpawnProcess-43' parent=2850 started>
  File "/Users/dez/git/monorepo/document_processing_service/.venv/lib/python3.10/site-packages/uvicorn/_subprocess.py", line 76, in subprocess_started
    target(sockets=sockets)
    │              └ [<socket.socket fd=3, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('127.0.0.1', 8028)>]
    └ <bound method Server.run of <uvicorn.server.Server object at 0x105527730>>
  File "/Users/dez/git/monorepo/document_processing_service/.venv/lib/python3.10/site-packages/uvicorn/server.py", line 61, in run
    return asyncio.run(self.serve(sockets=sockets))
           │       │   │    │             └ [<socket.socket fd=3, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('127.0.0.1', 8028)>]
           │       │   │    └ <function Server.serve at 0x1072e7400>
           │       │   └ <uvicorn.server.Server object at 0x105527730>
           │       └ <function run at 0x1070820e0>
           └ <module 'asyncio' from '/nix/store/fcdizvgrhss6yw5p0hm37423i2h4g53f-python3-3.10.12/lib/python3.10/asyncio/__init__.py'>
  File "/nix/store/fcdizvgrhss6yw5p0hm37423i2h4g53f-python3-3.10.12/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
           │    │                  └ <coroutine object Server.serve at 0x107822030>
           │    └ <method 'run_until_complete' of 'uvloop.loop.Loop' objects>
           └ <uvloop.Loop running=True closed=False debug=False>
> File "/Users/dez/git/monorepo/document_processing_service/.venv/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
                   └ <uvicorn.middleware.proxy_headers.ProxyHeadersMiddleware object at 0x10788d360>
  File "/Users/dez/git/monorepo/document_processing_service/.venv/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
                 │    │   │      │        └ <bound method RequestResponseCycle.send of <uvicorn.protocols.http.httptools_impl.RequestResponseCycle object at 0x12955eb00>>
                 │    │   │      └ <bound method RequestResponseCycle.receive of <uvicorn.protocols.http.httptools_impl.RequestResponseCycle object at 0x12955eb...
                 │    │   └ {'type': 'http', 'asgi': {'version': '3.0', 'spec_version': '2.3'}, 'http_version': '1.1', 'server': ('127.0.0.1', 8028), 'cl...
                 │    └ <fastapi.applications.FastAPI object at 0x1079aa9b0>
                 └ <uvicorn.middleware.proxy_headers.ProxyHeadersMiddleware object at 0x10788d360>
  File "/Users/dez/git/monorepo/document_processing_service/.venv/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
                           │      │        └ <bound method RequestResponseCycle.send of <uvicorn.protocols.http.httptools_impl.RequestResponseCycle object at 0x12955eb00>>
                           │      └ <bound method RequestResponseCycle.receive of <uvicorn.protocols.http.httptools_impl.RequestResponseCycle object at 0x12955eb...
                           └ {'type': 'http', 'asgi': {'version': '3.0', 'spec_version': '2.3'}, 'http_version': '1.1', 'server': ('127.0.0.1', 8028), 'cl...
  File "/Users/dez/git/monorepo/document_processing_service/.venv/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
          │    │                │      │        └ <bound method RequestResponseCycle.send of <uvicorn.protocols.http.httptools_impl.RequestResponseCycle object at 0x12955eb00>>
          │    │                │      └ <bound method RequestResponseCycle.receive of <uvicorn.protocols.http.httptools_impl.RequestResponseCycle object at 0x12955eb...
          │    │                └ {'type': 'http', 'asgi': {'version': '3.0', 'spec_version': '2.3'}, 'http_version': '1.1', 'server': ('127.0.0.1', 8028), 'cl...
          │    └ <starlette.middleware.errors.ServerErrorMiddleware object at 0x12955e5c0>
          └ <fastapi.applications.FastAPI object at 0x1079aa9b0>
  File "/Users/dez/git/monorepo/document_processing_service/.venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
          │    │   │      │        └ <function ServerErrorMiddleware.__call__.<locals>._send at 0x1295479a0>
          │    │   │      └ <bound method RequestResponseCycle.receive of <uvicorn.protocols.http.httptools_impl.RequestResponseCycle object at 0x12955eb...
          │    │   └ {'type': 'http', 'asgi': {'version': '3.0', 'spec_version': '2.3'}, 'http_version': '1.1', 'server': ('127.0.0.1', 8028), 'cl...
          │    └ <starlette.middleware.exceptions.ExceptionMiddleware object at 0x12955e590>
          └ <starlette.middleware.errors.ServerErrorMiddleware object at 0x12955e5c0>
  File "/Users/dez/git/monorepo/document_processing_service/.venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
          │                            │    │    │     │      │        └ <function ServerErrorMiddleware.__call__.<locals>._send at 0x1295479a0>
          │                            │    │    │     │      └ <bound method RequestResponseCycle.receive of <uvicorn.protocols.http.httptools_impl.RequestResponseCycle object at 0x12955eb...
          │                            │    │    │     └ {'type': 'http', 'asgi': {'version': '3.0', 'spec_version': '2.3'}, 'http_version': '1.1', 'server': ('127.0.0.1', 8028), 'cl...
          │                            │    │    └ <starlette.requests.Request object at 0x12955ec50>
          │                            │    └ <fastapi.routing.APIRouter object at 0x12955c460>
          │                            └ <starlette.middleware.exceptions.ExceptionMiddleware object at 0x12955e590>
          └ <function wrap_app_handling_exceptions at 0x108921900>
  File "/Users/dez/git/monorepo/document_processing_service/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
          │   │      │        └ <function wrap_app_handling_exceptions.<locals>.wrapped_app.<locals>.sender at 0x129547ac0>
          │   │      └ <bound method RequestResponseCycle.receive of <uvicorn.protocols.http.httptools_impl.RequestResponseCycle object at 0x12955eb...
          │   └ {'type': 'http', 'asgi': {'version': '3.0', 'spec_version': '2.3'}, 'http_version': '1.1', 'server': ('127.0.0.1', 8028), 'cl...
          └ <fastapi.routing.APIRouter object at 0x12955c460>
  File "/Users/dez/git/monorepo/document_processing_service/.venv/lib/python3.10/site-packages/starlette/routing.py", line 762, in __call__
    await self.middleware_stack(scope, receive, send)
          │    │                │      │        └ <function wrap_app_handling_exceptions.<locals>.wrapped_app.<locals>.sender at 0x129547ac0>
          │    │                │      └ <bound method RequestResponseCycle.receive of <uvicorn.protocols.http.httptools_impl.RequestResponseCycle object at 0x12955eb...
          │    │                └ {'type': 'http', 'asgi': {'version': '3.0', 'spec_version': '2.3'}, 'http_version': '1.1', 'server': ('127.0.0.1', 8028), 'cl...
          │    └ <bound method Router.app of <fastapi.routing.APIRouter object at 0x12955c460>>
          └ <fastapi.routing.APIRouter object at 0x12955c460>
  File "/Users/dez/git/monorepo/document_processing_service/.venv/lib/python3.10/site-packages/starlette/routing.py", line 782, in app
    await route.handle(scope, receive, send)
          │     │      │      │        └ <function wrap_app_handling_exceptions.<locals>.wrapped_app.<locals>.sender at 0x129547ac0>
          │     │      │      └ <bound method RequestResponseCycle.receive of <uvicorn.protocols.http.httptools_impl.RequestResponseCycle object at 0x12955eb...
          │     │      └ {'type': 'http', 'asgi': {'version': '3.0', 'spec_version': '2.3'}, 'http_version': '1.1', 'server': ('127.0.0.1', 8028), 'cl...
          │     └ <function Route.handle at 0x108922d40>
          └ APIRoute(path='/generate_nodes', name='handle', methods=['POST'])
  File "/Users/dez/git/monorepo/document_processing_service/.venv/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
          │    │   │      │        └ <function wrap_app_handling_exceptions.<locals>.wrapped_app.<locals>.sender at 0x129547ac0>
          │    │   │      └ <bound method RequestResponseCycle.receive of <uvicorn.protocols.http.httptools_impl.RequestResponseCycle object at 0x12955eb...
          │    │   └ {'type': 'http', 'asgi': {'version': '3.0', 'spec_version': '2.3'}, 'http_version': '1.1', 'server': ('127.0.0.1', 8028), 'cl...
          │    └ <function request_response.<locals>.app at 0x1295475b0>
          └ APIRoute(path='/generate_nodes', name='handle', methods=['POST'])
  File "/Users/dez/git/monorepo/document_processing_service/.venv/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
          │                            │    │        │      │        └ <function wrap_app_handling_exceptions.<locals>.wrapped_app.<locals>.sender at 0x129547ac0>
          │                            │    │        │      └ <bound method RequestResponseCycle.receive of <uvicorn.protocols.http.httptools_impl.RequestResponseCycle object at 0x12955eb...
          │                            │    │        └ {'type': 'http', 'asgi': {'version': '3.0', 'spec_version': '2.3'}, 'http_version': '1.1', 'server': ('127.0.0.1', 8028), 'cl...
          │                            │    └ <starlette.requests.Request object at 0x12955eec0>
          │                            └ <function request_response.<locals>.app.<locals>.app at 0x129547910>
          └ <function wrap_app_handling_exceptions at 0x108921900>
  File "/Users/dez/git/monorepo/document_processing_service/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
          │   │      │        └ <function wrap_app_handling_exceptions.<locals>.wrapped_app.<locals>.sender at 0x129547b50>
          │   │      └ <bound method RequestResponseCycle.receive of <uvicorn.protocols.http.httptools_impl.RequestResponseCycle object at 0x12955eb...
          │   └ {'type': 'http', 'asgi': {'version': '3.0', 'spec_version': '2.3'}, 'http_version': '1.1', 'server': ('127.0.0.1', 8028), 'cl...
          └ <function request_response.<locals>.app.<locals>.app at 0x129547910>
  File "/Users/dez/git/monorepo/document_processing_service/.venv/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
                     │    └ <starlette.requests.Request object at 0x12955eec0>
                     └ <function get_request_handler.<locals>.app at 0x129547490>
  File "/Users/dez/git/monorepo/document_processing_service/.venv/lib/python3.10/site-packages/fastapi/routing.py", line 294, in app
    raw_response = await run_endpoint_function(
                         └ <function run_endpoint_function at 0x108923be0>
  File "/Users/dez/git/monorepo/document_processing_service/.venv/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
                 │         │      └ {'generator': <document_processing_service.processor.node_generator.NodeGenerator object at 0x108b24280>, 'request': NodeExtr...
                 │         └ <function handle at 0x129546e60>
                 └ <fastapi.dependencies.models.Dependant object at 0x12955c5e0>

  File "/Users/dez/git/monorepo/document_processing_service/document_processing_service/routers/nodes.py", line 63, in handle
    response = await generator.get_chunked_text_from_artifact_async(request)
                     │         │                                    └ NodeExtractorRequest(artifact_id='be35e478-26a3-452b-b9c0-df62c5c89d00_artifact', workspace_id='5a6c5871-0138-4442-8cd9-ba13a...
                     │         └ <function NodeGenerator.get_chunked_text_from_artifact_async at 0x1294cfac0>
                     └ <document_processing_service.processor.node_generator.NodeGenerator object at 0x108b24280>

  File "/Users/dez/git/monorepo/document_processing_service/document_processing_service/processor/node_generator.py", line 343, in get_chunked_text_from_artifact_async
    return await asyncio.to_thread(
                 │       └ <function to_thread at 0x107099cf0>
                 └ <module 'asyncio' from '/nix/store/fcdizvgrhss6yw5p0hm37423i2h4g53f-python3-3.10.12/lib/python3.10/asyncio/__init__.py'>

  File "/nix/store/fcdizvgrhss6yw5p0hm37423i2h4g53f-python3-3.10.12/lib/python3.10/asyncio/threads.py", line 25, in to_thread
    return await loop.run_in_executor(None, func_call)
                 │    │                     └ functools.partial(<built-in method run of _contextvars.Context object at 0x1295bcd80>, <bound method NodeGenerator.get_chunke...
                 │    └ <method 'run_in_executor' of 'uvloop.loop.Loop' objects>
                 └ <uvloop.Loop running=True closed=False debug=False>
  File "/nix/store/fcdizvgrhss6yw5p0hm37423i2h4g53f-python3-3.10.12/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             │        │            └ None
             │        └ None
             └ None

  File "/Users/dez/git/monorepo/document_processing_service/document_processing_service/processor/node_generator.py", line 315, in get_chunked_text_from_artifact
    llama_nodes = self.generate_nodes_from_documents(
                  │    └ <staticmethod(<function NodeGenerator.generate_nodes_from_documents at 0x1294cf910>)>
                  └ <document_processing_service.processor.node_generator.NodeGenerator object at 0x108b24280>

  File "/Users/dez/git/monorepo/document_processing_service/document_processing_service/processor/node_generator.py", line 225, in generate_nodes_from_documents
    llama_nodes = splitter.get_nodes_from_documents(documents=documents)
                  │        │                                  └ [Document(id_='c8795746-380a-453f-8b45-8dbf2ac756ca', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_...
                  │        └ <function NodeParser.get_nodes_from_documents at 0x128b2a440>
                  └ SentenceSplitter(include_metadata=True, include_prev_next_rel=True, callback_manager=<llama_index.callbacks.base.CallbackMana...

  File "/Users/dez/git/monorepo/document_processing_service/.venv/lib/python3.10/site-packages/llama_index/node_parser/interface.py", line 72, in get_nodes_from_documents
    nodes = self._parse_nodes(documents, show_progress=show_progress, **kwargs)
            │    │            │                        │                └ {}
            │    │            │                        └ False
            │    │            └ [Document(id_='c8795746-380a-453f-8b45-8dbf2ac756ca', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_...
            │    └ <function MetadataAwareTextSplitter._parse_nodes at 0x128b2a950>
            └ SentenceSplitter(include_metadata=True, include_prev_next_rel=True, callback_manager=<llama_index.callbacks.base.CallbackMana...
  File "/Users/dez/git/monorepo/document_processing_service/.venv/lib/python3.10/site-packages/llama_index/node_parser/interface.py", line 199, in _parse_nodes
    splits = self.split_text_metadata_aware(
             │    └ <function SentenceSplitter.split_text_metadata_aware at 0x128b99090>
             └ SentenceSplitter(include_metadata=True, include_prev_next_rel=True, callback_manager=<llama_index.callbacks.base.CallbackMana...
  File "/Users/dez/git/monorepo/document_processing_service/.venv/lib/python3.10/site-packages/llama_index/node_parser/text/sentence.py", line 168, in split_text_metadata_aware
    return self._split_text(text, chunk_size=effective_chunk_size)
           │    │           │                └ 256
           │    │           └ 'gpqhecdkvajcfdnarzvypcpmfpndxuzxlltpwetokgeexmzijshpzddtgjsozhlyllblmfseesbcrbhkbrojrxerlqkamqwxjnfcnvlqjajoxcczzazfkoyrcrkq...
           │    └ <function SentenceSplitter._split_text at 0x128b991b0>
           └ SentenceSplitter(include_metadata=True, include_prev_next_rel=True, callback_manager=<llama_index.callbacks.base.CallbackMana...
  File "/Users/dez/git/monorepo/document_processing_service/.venv/lib/python3.10/site-packages/llama_index/node_parser/text/sentence.py", line 186, in _split_text
    splits = self._split(text, chunk_size)
             │    │      │     └ 256
             │    │      └ 'gpqhecdkvajcfdnarzvypcpmfpndxuzxlltpwetokgeexmzijshpzddtgjsozhlyllblmfseesbcrbhkbrojrxerlqkamqwxjnfcnvlqjajoxcczzazfkoyrcrkq...
             │    └ <function SentenceSplitter._split at 0x128b99240>
             └ SentenceSplitter(include_metadata=True, include_prev_next_rel=True, callback_manager=<llama_index.callbacks.base.CallbackMana...
  File "/Users/dez/git/monorepo/document_processing_service/.venv/lib/python3.10/site-packages/llama_index/node_parser/text/sentence.py", line 214, in _split
    token_size = self._token_size(text)
                 │    │           └ 'gpqhecdkvajcfdnarzvypcpmfpndxuzxlltpwetokgeexmzijshpzddtgjsozhlyllblmfseesbcrbhkbrojrxerlqkamqwxjnfcnvlqjajoxcczzazfkoyrcrkq...
                 │    └ <function SentenceSplitter._token_size at 0x128b993f0>
                 └ SentenceSplitter(include_metadata=True, include_prev_next_rel=True, callback_manager=<llama_index.callbacks.base.CallbackMana...
  File "/Users/dez/git/monorepo/document_processing_service/.venv/lib/python3.10/site-packages/llama_index/node_parser/text/sentence.py", line 317, in _token_size
    return len(self._tokenizer(text))
               │    │          └ 'gpqhecdkvajcfdnarzvypcpmfpndxuzxlltpwetokgeexmzijshpzddtgjsozhlyllblmfseesbcrbhkbrojrxerlqkamqwxjnfcnvlqjajoxcczzazfkoyrcrkq...
               │    └ <member '_tokenizer' of 'SentenceSplitter' objects>
               └ SentenceSplitter(include_metadata=True, include_prev_next_rel=True, callback_manager=<llama_index.callbacks.base.CallbackMana...
  File "/Users/dez/git/monorepo/document_processing_service/.venv/lib/python3.10/site-packages/tiktoken/core.py", line 120, in encode
    return self._core_bpe.encode(text, allowed_special)
           │    │         │      │     └ {'<|fim_suffix|>', '<|endofprompt|>', '<|fim_prefix|>', '<|endoftext|>', '<|fim_middle|>'}
           │    │         │      └ 'gpqhecdkvajcfdnarzvypcpmfpndxuzxlltpwetokgeexmzijshpzddtgjsozhlyllblmfseesbcrbhkbrojrxerlqkamqwxjnfcnvlqjajoxcczzazfkoyrcrkq...
           │    │         └ <method 'encode' of 'builtins.CoreBPE' objects>
           │    └ <builtins.CoreBPE object at 0x129552170>
           └ <Encoding 'cl100k_base'>

pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: RuntimeError(StackOverflow)

run-llama / llama_index

[Bug]: SentenceSplitter_split crashes on a sequence of 10 MB lower letters #11341

Bug Description

Version

Steps to Reproduce

Relevant Logs/Tracbacks

Sources