umarbutler / semchunk

A fast and lightweight pure Python library for splitting text into semantically meaningful chunks.
MIT License
97 stars 6 forks source link

Problem with very large files #8

Open do-me opened 6 days ago

do-me commented 6 days ago

I am trying to chunk a huge document but it runs forever. Did I miss something in my code?

File here

import semchunk                  
import pandas as pd 

df = pd.read_parquet("free_trade_agreement.parquet")
chunk_size = 100 # A low chunk size is used here for demonstration purposes.

text = df.text.item()

# As you can see below, `semchunk.chunkerify` will accept the names of all OpenAI models, OpenAI
# `tiktoken` encodings and Hugging Face models (in that order of precedence), along with custom
# tokenizers that have an `encode()` method (such as `tiktoken`, `transformers` and `tokenizers`
# tokenizers) and finally any function that can take a text and return the number of tokens in it.
chunker = semchunk.chunkerify(lambda text: len(text.split()), chunk_size)

a = chunker(text)

Referencing https://github.com/benbrandt/text-splitter/issues/184 in semantic-text-splitter where I can now chunk the same document in ~2s.

do-me commented 4 days ago

Update: It took 593 mins on my M3 and created 130324 chunks. 😄

umarbutler commented 4 days ago

Hey @do-me, Thanks for creating this issue. I certainly haven't seen anything like this before! I can confirm that it also takes an awfully long time on my PC.

It is suprising that semantic-text-splitter is able to chunk the text in 2 seconds as my benchmarks have semchunk 90% faster than semantic-text-splitter. Those benchmarks are calculated on the entire Gutenberg corpus with a chunk size of 512.

My assumption is that there are some unique characteristics of this particular piece of text that make it difficult to chunk quickly and that semantic-text-splitter has implemented some specific heuristics to make it work.

In particular, I note that:

  1. The text is extremely long, with ~86,608,271 characters and 13,399,471 words.
  2. The text has no newlines.
  3. The text does not have very many long sequences of whitespaces that may be used as splitters.

Splitting the text using string.split() takes 1.04 seconds and that is just a single call. semchunk makes repeated calls to token counters as it merges sequences of tokens back together. We already use some heuristics to avoid unnecessary calls but it is difficult to implement heuristics that do not result in the loss of certain user guarantees or otherwise reduce performance for other use cases.

At the risk of oversimplifying, the problem as I see it is that, because your text does not have any newlines nor very many sequences of whitespace characters, what ends up happening is that semchunk needs to split at individual spaces (like doing text.split()) and then needs to merge all the words back together until they hit the chunk size. Whenever semchunk tries to merge words together, it has to count how many tokens the potential merge would have and it needs to make sure that it won't exceed the chunk size. The repeated calls to token counters end up taking a very long time. I have considered multiprocessing but it is difficult to implement for single inputs (it is already available for multiple inputs).

I will investigate potential solutions but I cannot make any promises on how long it might take as I need to be careful to not reduce performance for more typical use cases.

For now, my tips to boost performance are:

  1. Don't use lambda functions as token counters.
  2. Provide max_token_chars as an argument to chunkerify where max_token_chars is the maximum number of characters that can be expected to appear in a token. This can significantly speed up token counting, particularly for very long texts.
  3. Use multiprocessing when chunking multiple texts.
  4. If you can, avoid providing very long texts without any newlines. If you have to, break them up yourself into smaller but still large chunks to be further chunked.
do-me commented 4 days ago

No pressure from my side, thank very much for investigating and thanks for the hints!

Indeed, the missing newlines is what causes problems with many splitting algorithms as everything is on the same "splitting level". It's a flaw in the dataset I am working with that unfortunately (for the moment) I cannot change.

I guess it would also be reasonable to say those kind of files are out of scope. In the end, I cannot think of any real world documents where this might actually be required...