openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
12.52k stars 856 forks source link

Uses Regex instead of fancy-regex - 6x speedup #331

Open Majdoddin opened 3 months ago

Majdoddin commented 3 months ago

This PR realizes the wish expressed in current code to use the faster Regex.

The text is splitted to pieces, before tokenization, according to regular expression patterns. This PR drops a lookahead part of the pattern, the part for catching the whitespaces, and handles the whitespaces with scripting instead, with mathematically provable exactly same output. This makes it possible to use linear-time Regex instead of fancy-regex, as Regex does not support lookahead, resulting in a 14x speedup of pattern matching. As pattern matching currently comprises 90% of the encoding runtime, the total runtime is boosted 6x.

Although fancy_regex delegates to Regex, when the pattern has no special features, it is still some 10% slower in test, thus we directly use Regex. This improvement is for pattern matching of the parts with ordinary text. Catching the special tokens is still done with fancy_regex.

Tests For encoding o200k_base (used by model GPT-4o) Text Number of tokens Current Runtime PR Runtime
wikitext-103 (100 MB) 22138325 18.94s 4.94s
Linux code (100 MB) 36119543 30.28s 4.59s
Bigheem commented 2 months ago

@bigheemseafood

tmm1 commented 2 weeks ago

Thanks for your work on this!

I noticed this code block which sounds like it would need to change along with these regexes?

https://github.com/openai/tiktoken/blob/63527649963def8c759b0f91f2eb69a40934e468/src/lib.rs#L405-L409

Majdoddin commented 6 days ago

@tmm1 I've implement it for encode_ordinary(). Than part is for unstable encoding. By the way, I'd really appreciate if you submit a review for this PR.