Optimal byte_pair_encode(), 6% faster, 0.6% better COMPRESSION

openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.

MIT License

11.06k stars 749 forks source link

This PR reimplements byte_pair_encode() of src/lib.rs, so the compression of large texts, that is the number of tokens, is improved by 0.6%. Runtime is same or slightly better.

This is important, because any improvement in compression, directly translates into the throughput of the LLM.

With a fixed vocabulary, there are often many ways to encode a text. The current code is not optimal in the size of tokenizatin (number of tokens).

This PR, finds a minimal tokenization using a dynamic programming algorithm. It is guaranteed to have the minimal number of tokens. To get an overview, see the 1st commit. The 2nd commit optimizes it for runtime.

Tested on 100MB of Linux source code, run on 1 CPU core, this RP encoded it with 48304002 tokens (in 20.07s), vs 48006634 tokens (in 20.21s) from the current code, thus a 0.6% improvement in compression.

With the new commits, the code runs 6% faster now (as this PR is closed, the new commits do not appear here).

The dynamic programming algorithm in byte_pair_encode() has the property that no index of the dp and backtrack vectors is read before it is set. Exploiting this, I moved byte_pair_encode() to a struct BytePairEncode, with the vectors as its fields. This prevents the vectors from being declared on each call, thus the 6% runtime boost (19.04s from 20.21s).

Thank you @hauntsaninja for your review. Your point is indeed valid, although GPT models seem to be able handle different token distributions, such as code vs English text. And sure, you can certainly try this PR when training new OpenAI models. Please let me know if it has been useful.

openai / tiktoken

Optimal byte_pair_encode(), 6% faster, 0.6% better COMPRESSION #316