umarbutler / semchunk

A fast and lightweight pure Python library for splitting text into semantically meaningful chunks.
MIT License
154 stars 9 forks source link

πŸ“ˆ performance optimisation #3

Closed R0bk closed 4 months ago

R0bk commented 6 months ago

Hey @umarbutler,

You've made a really neat package here - performant and to the point.

Recently I've been running through some of the Aus reg landscape and working on-top of your codebase has been a big aid. The chunking was still one of the bottlenecks on my machine so I took a pass at some further performance work.

I was able to find some ~25% performance gains (larger chunks held higher gains). Running a sweep across chunk_sizes for:

  1. Semchunk - 0.2.3
  2. Semchunkv2 - this fork
  3. semantic_text_splitter

we get below results (running on a M1 Max chip).

Chunk Size semchunk semchunkv2 Improvement v2 vs v1 (%) semantic_text_splitter Improvement v2 vs splitter (%)
8 13.07s 9.39s 28.16% 928.86s 98.99%
16 10.40s 7.40s 28.85% 732.80s 98.99%
32 7.42s 5.11s 31.13% 460.74s 98.89%
64 7.88s 5.48s 30.46% 293.32s 98.13%
128 7.85s 5.91s 24.71% 176.95s 96.66%
256 8.42s 6.16s 26.84% 104.85s 94.12%
512 10.02s 6.52s 34.93% 66.50s 90.20%
1024 13.82s 7.22s 47.76% 46.81s 84.58%
Screenshot 2024-04-03 at 9 26 48 pm

This also stretches out the lead over semantic_text_splitter by another 10% or so. It's worth noting that semantic_text_splitter seems to perform even worse for myself than yourself though, not sure if due to recent changes in their lib, x86 v arm, or noise.

I believe there's still further room for non-trivial gains, it seemed that the token_counter (using tiktoken as the example) took up about 80% of cycles. If we can reduce the number of calls or the length of an average input by say:

then there's probably double digit gains possible.

Please feel free for any comments or changes - thanks for making the package

umarbutler commented 5 months ago

Hey @R0bk, Thank you very much for submitting this PR, I'll review it and make some changes and hopefully we can get it merged soon :)

umarbutler commented 5 months ago

Hey @R0bk, Thanks again for sharing this PR. Unfortunately, I don't think I'll be able to merge the changes at this moment in time. The count function does indeed speed chunking up considerably, particularly for large chunk sizes, as your benchmarks show, however, it can also lead to inaccurate token counts, which could have unexpected downstream effects for end users. The challenge is that, although the longest token in GPT-4 tokeniser is 42 characters, it may be much larger in other tokenisers. I checked the tokeniser of a custom private model I am working with, for instance, and found the longest token to be over 200 characters long.

For those users who do want an extra speed bump, it could be worth simply using a token counter that implements the heuristics you have developed. You could even use a wrapper function for arbitrary token counters. I will leave that decision to users as I want to ensure semchunk is robust for all variety of use cases.

One speed enhancement I would be able to merge is replacing recursive calls to semchunk.semchunk with a stack. I had a crack at that when I first wrote semchunk but couldn't get it to work. It is theoretically possible, however, and it would offer some performance gains.

Two other suggestions I can give for speeding up chunking is to utilise multiprocessing if you aren't already (I've successfully used it to chunk corpora that would otherwise take hours to chunk in a matter of minutes) and to use my other Python library persist-cache if you find yourself chunking the same content repeatedly (note: it is better to decorate your token counter with persist-cache rather than semchunk since that is what takes up the most time, and it'll still save time when using different chunk sizes).

R0bk commented 5 months ago

Hey @umarbutler

Those are some good points, the first thing I actually went for was an adjustment to be stack based. But unfortunately once I finished and verified that the results were equal, I found it had the same performance as the recursive version if not slightly slower. The code was neater, however the current code gains an advantage as it always splits first and then runs the tokeniser. Changing to be stack based (and with my impression of what neat code is) I had to run the tokeniser first, which added an extra run across the entire input string and hence slowed it.

After profiling I saw that all the time is spent either in the regex ~20% or in the actual library call for the tokeniser ~75%. So the benefit of going stack based seemed pretty limited.

Also, on the heuristic I added, I must say it was a little bit hacky, even if performant - and you've got a good idea to keep this project independent of any one tokeniser/ vocab. Your post did inspire me to take another look though, and see what's possible when ensuring tokeniser/ vocab independence. I've been able to find some reasonably big gains just from adjusting the binary search to do a simple guess for where to look next.

The difference is larger at bigger batch sizes but there is a real double digit impact even at 512. I've put some benchmark times below and adjusted the code to also check for correctness between this build and the original to ensure there are no gaps.

batch size 8 16 32 64 128 256 512 1024 2048 4096 8192
semchunk 13.90s 11.14s 8.30s 8.30s 8.76s 9.49s 11.08s 15.88s 25.25s 37.01s 56.41s
semchunkv2 13.10s 11.49s 9.42s 8.47s 7.63s 7.24s 6.74s 6.62s 6.67s 6.03s 5.64s
Screenshot 2024-04-22 at 10 53 55 pm

You can checkout the new commit here. I've been running it through a pretty large corpuses at 8192 and it's been performing well so far! (I've also been taking advantage of the disk-cache library but yours looks interesting, it certainly looks easier to manage, do you know if it is faster too?)

umarbutler commented 4 months ago

Hey Rob, Sorry it's taken me so long to get back to you! I just needed to make some time to be able to review your PR. It all looks good to me :) I'll be merging it shortly, I might just see if there are other performance improvements I can add to semchunk before releasing a new version.

RE diskcache, I'm not too sure, I haven't benchmarked it. I might when I have some spare time. persist-cache is unique in that it essentially uses the file system as a database which means the biggest limitation in finding and saving keys is the OS and disk. Serialisation and hashing are super fast thanks to a custom fusion of msgpack and pickle (with msgspec being used for speedy msgpack serialisation) and xxhash. The only downside is that if you are storing millions of keys, it can a really really long time to delete the cache (since each key gets its own file on the disk).

umarbutler commented 4 months ago

The changes have just been merged into the latest release of semchunk! Apologies again about my delay in getting on this.

I ran my profiler and it looks like most of the time is being spend in the token counter and tokeniser, followed by a bit of time spent in itertools.accumulate. I tried seeing if I couldn't create a faster implementation or take advantage of array.array but no dice.

On further thought, I think your heuristics could be worked into semchunk by adding a make_chunker convenience function that turns a chunk size and a tiktoken or tokenizers tokenizer into a customised semchunk function. If the provided tokenizer is one whose vocabulary we can access, we can then cache the number of characters in the longest token in that vocabulary to ensure there are never any edge cases :)

That will be my next task for semchunk. If you have any other ideas for micro-optimisations, please let me know.

I use semchunk very often for my work so even tiny optimisations can end up saving a lot of time.