Open do-me opened 6 days ago
Update: It took 593 mins on my M3 and created 130324 chunks. 😄
Hey @do-me, Thanks for creating this issue. I certainly haven't seen anything like this before! I can confirm that it also takes an awfully long time on my PC.
It is suprising that semantic-text-splitter
is able to chunk the text in 2 seconds as my benchmarks have semchunk
90% faster than semantic-text-splitter
. Those benchmarks are calculated on the entire Gutenberg corpus with a chunk size of 512.
My assumption is that there are some unique characteristics of this particular piece of text that make it difficult to chunk quickly and that semantic-text-splitter
has implemented some specific heuristics to make it work.
In particular, I note that:
Splitting the text using string.split()
takes 1.04 seconds and that is just a single call. semchunk
makes repeated calls to token counters as it merges sequences of tokens back together. We already use some heuristics to avoid unnecessary calls but it is difficult to implement heuristics that do not result in the loss of certain user guarantees or otherwise reduce performance for other use cases.
At the risk of oversimplifying, the problem as I see it is that, because your text does not have any newlines nor very many sequences of whitespace characters, what ends up happening is that semchunk
needs to split at individual spaces (like doing text.split()
) and then needs to merge all the words back together until they hit the chunk size. Whenever semchunk
tries to merge words together, it has to count how many tokens the potential merge would have and it needs to make sure that it won't exceed the chunk size. The repeated calls to token counters end up taking a very long time. I have considered multiprocessing but it is difficult to implement for single inputs (it is already available for multiple inputs).
I will investigate potential solutions but I cannot make any promises on how long it might take as I need to be careful to not reduce performance for more typical use cases.
For now, my tips to boost performance are:
max_token_chars
as an argument to chunkerify
where max_token_chars
is the maximum number of characters that can be expected to appear in a token. This can significantly speed up token counting, particularly for very long texts.No pressure from my side, thank very much for investigating and thanks for the hints!
Indeed, the missing newlines is what causes problems with many splitting algorithms as everything is on the same "splitting level". It's a flaw in the dataset I am working with that unfortunately (for the moment) I cannot change.
I guess it would also be reasonable to say those kind of files are out of scope. In the end, I cannot think of any real world documents where this might actually be required...
I am trying to chunk a huge document but it runs forever. Did I miss something in my code?
File here
Referencing https://github.com/benbrandt/text-splitter/issues/184 in semantic-text-splitter where I can now chunk the same document in ~2s.