📈 performance optimisation

R0bk commented 6 months ago

Hey @umarbutler,

You've made a really neat package here - performant and to the point.

Recently I've been running through some of the Aus reg landscape and working on-top of your codebase has been a big aid. The chunking was still one of the bottlenecks on my machine so I took a pass at some further performance work.

I was able to find some ~25% performance gains (larger chunks held higher gains). Running a sweep across chunk_sizes for:

Semchunk - 0.2.3
Semchunkv2 - this fork
semantic_text_splitter

we get below results (running on a M1 Max chip).

Chunk Size	semchunk	semchunkv2	Improvement v2 vs v1 (%)	semantic_text_splitter	Improvement v2 vs splitter (%)
8	13.07s	9.39s	28.16%	928.86s	98.99%
16	10.40s	7.40s	28.85%	732.80s	98.99%
32	7.42s	5.11s	31.13%	460.74s	98.89%
64	7.88s	5.48s	30.46%	293.32s	98.13%
128	7.85s	5.91s	24.71%	176.95s	96.66%
256	8.42s	6.16s	26.84%	104.85s	94.12%
512	10.02s	6.52s	34.93%	66.50s	90.20%
1024	13.82s	7.22s	47.76%	46.81s	84.58%

This also stretches out the lead over semantic_text_splitter by another 10% or so. It's worth noting that semantic_text_splitter seems to perform even worse for myself than yourself though, not sure if due to recent changes in their lib, x86 v arm, or noise.

I believe there's still further room for non-trivial gains, it seemed that the token_counter (using tiktoken as the example) took up about 80% of cycles. If we can reduce the number of calls or the length of an average input by say:

Using the token_length returned to predict the next binary search split or,
Doing bottom up recursion to build up the token counts (it's probably possible to calculate error bounds when adding token counts from multiple nodes together - telling if a full token_counter(...) call is required).

then there's probably double digit gains possible.

Please feel free for any comments or changes - thanks for making the package

umarbutler commented 5 months ago

Hey @R0bk, Thank you very much for submitting this PR, I'll review it and make some changes and hopefully we can get it merged soon :)

umarbutler commented 5 months ago

Hey @R0bk, Thanks again for sharing this PR. Unfortunately, I don't think I'll be able to merge the changes at this moment in time. The count function does indeed speed chunking up considerably, particularly for large chunk sizes, as your benchmarks show, however, it can also lead to inaccurate token counts, which could have unexpected downstream effects for end users. The challenge is that, although the longest token in GPT-4 tokeniser is 42 characters, it may be much larger in other tokenisers. I checked the tokeniser of a custom private model I am working with, for instance, and found the longest token to be over 200 characters long.

For those users who do want an extra speed bump, it could be worth simply using a token counter that implements the heuristics you have developed. You could even use a wrapper function for arbitrary token counters. I will leave that decision to users as I want to ensure semchunk is robust for all variety of use cases.

One speed enhancement I would be able to merge is replacing recursive calls to semchunk.semchunk with a stack. I had a crack at that when I first wrote semchunk but couldn't get it to work. It is theoretically possible, however, and it would offer some performance gains.

Two other suggestions I can give for speeding up chunking is to utilise multiprocessing if you aren't already (I've successfully used it to chunk corpora that would otherwise take hours to chunk in a matter of minutes) and to use my other Python library persist-cache if you find yourself chunking the same content repeatedly (note: it is better to decorate your token counter with persist-cache rather than semchunk since that is what takes up the most time, and it'll still save time when using different chunk sizes).

R0bk commented 5 months ago

Hey @umarbutler

Those are some good points, the first thing I actually went for was an adjustment to be stack based. But unfortunately once I finished and verified that the results were equal, I found it had the same performance as the recursive version if not slightly slower. The code was neater, however the current code gains an advantage as it always splits first and then runs the tokeniser. Changing to be stack based (and with my impression of what neat code is) I had to run the tokeniser first, which added an extra run across the entire input string and hence slowed it.

After profiling I saw that all the time is spent either in the regex ~20% or in the actual library call for the tokeniser ~75%. So the benefit of going stack based seemed pretty limited.

Also, on the heuristic I added, I must say it was a little bit hacky, even if performant - and you've got a good idea to keep this project independent of any one tokeniser/ vocab. Your post did inspire me to take another look though, and see what's possible when ensuring tokeniser/ vocab independence. I've been able to find some reasonably big gains just from adjusting the binary search to do a simple guess for where to look next.

The difference is larger at bigger batch sizes but there is a real double digit impact even at 512. I've put some benchmark times below and adjusted the code to also check for correctness between this build and the original to ensure there are no gaps.

batch size	8	16	32	64	128	256	512	1024	2048	4096	8192
semchunk	13.90s	11.14s	8.30s	8.30s	8.76s	9.49s	11.08s	15.88s	25.25s	37.01s	56.41s
semchunkv2	13.10s	11.49s	9.42s	8.47s	7.63s	7.24s	6.74s	6.62s	6.67s	6.03s	5.64s

You can checkout the new commit here. I've been running it through a pretty large corpuses at 8192 and it's been performing well so far! (I've also been taking advantage of the disk-cache library but yours looks interesting, it certainly looks easier to manage, do you know if it is faster too?)

umarbutler commented 4 months ago

Hey Rob, Sorry it's taken me so long to get back to you! I just needed to make some time to be able to review your PR. It all looks good to me :) I'll be merging it shortly, I might just see if there are other performance improvements I can add to semchunk before releasing a new version.

RE diskcache, I'm not too sure, I haven't benchmarked it. I might when I have some spare time. persist-cache is unique in that it essentially uses the file system as a database which means the biggest limitation in finding and saving keys is the OS and disk. Serialisation and hashing are super fast thanks to a custom fusion of msgpack and pickle (with msgspec being used for speedy msgpack serialisation) and xxhash. The only downside is that if you are storing millions of keys, it can a really really long time to delete the cache (since each key gets its own file on the disk).

umarbutler commented 4 months ago

The changes have just been merged into the latest release of semchunk! Apologies again about my delay in getting on this.

I ran my profiler and it looks like most of the time is being spend in the token counter and tokeniser, followed by a bit of time spent in itertools.accumulate. I tried seeing if I couldn't create a faster implementation or take advantage of array.array but no dice.

On further thought, I think your heuristics could be worked into semchunk by adding a make_chunker convenience function that turns a chunk size and a tiktoken or tokenizers tokenizer into a customised semchunk function. If the provided tokenizer is one whose vocabulary we can access, we can then cache the number of characters in the longest token in that vocabulary to ensure there are never any edge cases :)

That will be my next task for semchunk. If you have any other ideas for micro-optimisations, please let me know.

I use semchunk very often for my work so even tiny optimisations can end up saving a lot of time.

umarbutler / semchunk

📈 performance optimisation #3