Open l0rinc opened 10 months ago
This is great!
Thanks for keeping a simple to follow history. Most of the commits here are straightforward, I've separated them into different PRs (I've preserved authorship information, but let me know if you'd prefer to re-open them yourself)
I'm looking at 8f5dd7d and d24b67b now...
Thanks a lot for the thorough review, Shantanu. Let me know if you need any help in speeding up the process! :)
After merging you may want to update the benchmark results in the readme.
@hauntsaninja, I've rebased this PR, removing the merged commits and adjusting the result a bit based on your previous preferences. Hope it helps. Feel free to push on top of this PR or open a different one, whichever's easier.
Hi! Any updates on this PR? It'd be great to have this 🙏 @hauntsaninja
Continuing the optimizations started in https://github.com/openai/tiktoken/pull/237 and https://github.com/openai/tiktoken/pull/234, migrated from https://github.com/knuddelsgmbh/jtokkit/pull/75, https://github.com/knuddelsgmbh/jtokkit/pull/76, https://github.com/knuddelsgmbh/jtokkit/pull/77.
This commit is mainly meant to address the issue of really big tokens spiraling out of control, see: https://github.com/openai/tiktoken/issues/195
The original byte pair merge algorithm diverges quickly for longer character sequences in a superlinear way - e.g. a 20_000 character word (having 2500 tokens) can take several seconds to be tokenized.
Or on https://platform.openai.com/tokenizer:
The new algorithm scales so well that it could theoretically process the whole text in a single byte-pair-merge loop without any regex splitting (though it would need a different token set to be optimal since it produces slightly different results, mostly whitespaces, though - and it also consumes a lot more memory and is slower that the current one):
The new algorithm does the minimum search logarithmically and duplicates in constant time, but has a higher setup cost, so we're only using it for extreme cases (if the piece given by the regex is > 500 characters):
The benchmarking was done step-by-step in the Java clone and here retested in the way described in https://github.com/openai/tiktoken/pull/234
110 multilingual books + some source codes + some big token files:
Before:
After:
From which only the big token files (the purpose of this PR):
Before:
After:
i.e.
40x
faster for 20k character words.And if we combine this with the previous regex optimizations, we're getting the following for the 110 books + sources + big tokens case:
i.e. 50% faster on average after all optimizations.
I recommend reviewing commit-by-commit: