openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
12.52k stars 856 forks source link

Add possessive quantifiers to avoid catastrophic backtracking #258

Closed l0rinc closed 1 month ago

l0rinc commented 9 months ago

Fixes the crash in https://github.com/openai/tiktoken/issues/245 by prohibiting the regex engine from backtracking catastrophically via possessive quantifiers.

image

Interestingly these possesives make the encoding a lot faster again in fancy-regex.

Before this change (but with large byte pair merge PR cherry-picked):

num_threads: 1, num_bytes: 98379553
tiktoken    11,946,036 bytes / s
tiktoken    11,961,343 bytes / s
tiktoken    11,995,846 bytes / s
tiktoken    11,951,263 bytes / s
tiktoken    11,983,405 bytes / s

Same, with these changes applied:

num_threads: 1, num_bytes: 98379553
tiktoken    14,511,827 bytes / s
tiktoken    14,638,134 bytes / s
tiktoken    14,644,029 bytes / s
tiktoken    14,729,030 bytes / s
tiktoken    14,666,903 bytes / s

Updating the regex libs makes it a tiny bit faster still:

num_threads: 1, num_bytes: 98379553
tiktoken    14,485,590 bytes / s
tiktoken    14,854,049 bytes / s
tiktoken    14,891,086 bytes / s
tiktoken    14,843,007 bytes / s
tiktoken    14,874,520 bytes / s

This is almost 2x faster than before any of the optimizations.


Opened an issue for increasing the default backtrack limit, see: https://github.com/fancy-regex/fancy-regex/issues/134, but it shouldn't be necessary here anymore.

tmm1 commented 1 month ago

Was the backtrack limit reverted intentionally in 05e66e8? https://github.com/openai/tiktoken/commit/05e66e8db7ef220d3c0b1aafbee5af289345684b#diff-b1a35a68f14e696205874893c07fd24fdb88882b47c23cc0e0c80a30c7d53759L421-R438

Was there a regression?

hauntsaninja commented 1 month ago

Yes, it was reverted intentionally. There are OpenAI internal encodings where setting the limit caused issues.

l0rinc commented 1 month ago

Thanks for checking @tmm1, @hauntsaninja.