openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
11.06k stars 749 forks source link

Add possessive quantifiers to avoid catastrophic backtracking #258

Open paplorinc opened 4 months ago

paplorinc commented 4 months ago

Fixes the crash in https://github.com/openai/tiktoken/issues/245 by prohibiting the regex engine from backtracking catastrophically via possessive quantifiers.

image

Interestingly these possesives make the encoding a lot faster again in fancy-regex.

Before this change (but with large byte pair merge PR cherry-picked):

num_threads: 1, num_bytes: 98379553
tiktoken    11,946,036 bytes / s
tiktoken    11,961,343 bytes / s
tiktoken    11,995,846 bytes / s
tiktoken    11,951,263 bytes / s
tiktoken    11,983,405 bytes / s

Same, with these changes applied:

num_threads: 1, num_bytes: 98379553
tiktoken    14,511,827 bytes / s
tiktoken    14,638,134 bytes / s
tiktoken    14,644,029 bytes / s
tiktoken    14,729,030 bytes / s
tiktoken    14,666,903 bytes / s

Updating the regex libs makes it a tiny bit faster still:

num_threads: 1, num_bytes: 98379553
tiktoken    14,485,590 bytes / s
tiktoken    14,854,049 bytes / s
tiktoken    14,891,086 bytes / s
tiktoken    14,843,007 bytes / s
tiktoken    14,874,520 bytes / s

This is almost 2x faster than before any of the optimizations.


Opened an issue for increasing the default backtrack limit, see: https://github.com/fancy-regex/fancy-regex/issues/134, but it shouldn't be necessary here anymore.