Optimize byte pair merge for really big tokens (40x faster for a 2500 token word)

l0rinc commented 10 months ago

Continuing the optimizations started in https://github.com/openai/tiktoken/pull/237 and https://github.com/openai/tiktoken/pull/234, migrated from https://github.com/knuddelsgmbh/jtokkit/pull/75, https://github.com/knuddelsgmbh/jtokkit/pull/76, https://github.com/knuddelsgmbh/jtokkit/pull/77.

This commit is mainly meant to address the issue of really big tokens spiraling out of control, see: https://github.com/openai/tiktoken/issues/195

The original byte pair merge algorithm diverges quickly for longer character sequences in a superlinear way - e.g. a 20_000 character word (having 2500 tokens) can take several seconds to be tokenized.

Or on https://platform.openai.com/tokenizer:

The new algorithm scales so well that it could theoretically process the whole text in a single byte-pair-merge loop without any regex splitting (though it would need a different token set to be optimal since it produces slightly different results, mostly whitespaces, though - and it also consumes a lot more memory and is slower that the current one):

The new algorithm does the minimum search logarithmically and duplicates in constant time, but has a higher setup cost, so we're only using it for extreme cases (if the piece given by the regex is > 500 characters):

The benchmarking was done step-by-step in the Java clone and here retested in the way described in https://github.com/openai/tiktoken/pull/234

110 multilingual books + some source codes + some big token files:

Before:

num_threads: 1, num_bytes: 98379144
tiktoken    7,891,071 bytes / s
tiktoken    7,927,160 bytes / s
tiktoken    7,936,709 bytes / s
tiktoken    7,912,032 bytes / s
tiktoken    7,928,872 bytes / s

After:

num_threads: 1, num_bytes: 98379144
tiktoken    9,494,719 bytes / s
tiktoken    9,547,619 bytes / s
tiktoken    9,525,891 bytes / s
tiktoken    9,506,282 bytes / s
tiktoken    9,563,429 bytes / s

From which only the big token files (the purpose of this PR):

Before:

num_threads: 1, num_bytes: 392784
tiktoken    195,554 bytes / s
tiktoken    195,998 bytes / s
tiktoken    196,051 bytes / s
tiktoken    194,724 bytes / s
tiktoken    195,748 bytes / s

After:

num_threads: 1, num_bytes: 392784
tiktoken    8,604,360 bytes / s
tiktoken    8,628,191 bytes / s
tiktoken    8,561,823 bytes / s
tiktoken    8,675,756 bytes / s
tiktoken    8,618,370 bytes / s

i.e. 40x faster for 20k character words.

And if we combine this with the previous regex optimizations, we're getting the following for the 110 books + sources + big tokens case:

num_threads: 1, num_bytes: 98379144
tiktoken    12,043,907 bytes / s
tiktoken    12,153,199 bytes / s
tiktoken    12,173,271 bytes / s
tiktoken    12,085,368 bytes / s
tiktoken    12,147,123 bytes / s

i.e. 50% faster on average after all optimizations.

I recommend reviewing commit-by-commit:

hauntsaninja commented 9 months ago

This is great!

Thanks for keeping a simple to follow history. Most of the commits here are straightforward, I've separated them into different PRs (I've preserved authorship information, but let me know if you'd prefer to re-open them yourself)

https://github.com/openai/tiktoken/pull/251 (I got a little excited and merged this one before I realised I should ask you, hope that's alright)
https://github.com/openai/tiktoken/pull/252
https://github.com/openai/tiktoken/pull/253

I'm looking at 8f5dd7d and d24b67b now...

l0rinc commented 9 months ago

Thanks a lot for the thorough review, Shantanu. Let me know if you need any help in speeding up the process! :)

After merging you may want to update the benchmark results in the readme.

l0rinc commented 9 months ago

@hauntsaninja, I've rebased this PR, removing the merged commits and adjusting the result a bit based on your previous preferences. Hope it helps. Feel free to push on top of this PR or open a different one, whichever's easier.

vvolhejn commented 7 months ago

Hi! Any updates on this PR? It'd be great to have this 🙏 @hauntsaninja

openai / tiktoken

Optimize byte pair merge for really big tokens (40x faster for a 2500 token word) #239