Closed l0rinc closed 1 month ago
Was the backtrack limit reverted intentionally in 05e66e8? https://github.com/openai/tiktoken/commit/05e66e8db7ef220d3c0b1aafbee5af289345684b#diff-b1a35a68f14e696205874893c07fd24fdb88882b47c23cc0e0c80a30c7d53759L421-R438
Was there a regression?
Yes, it was reverted intentionally. There are OpenAI internal encodings where setting the limit caused issues.
Thanks for checking @tmm1, @hauntsaninja.
Fixes the crash in https://github.com/openai/tiktoken/issues/245 by prohibiting the regex engine from backtracking catastrophically via possessive quantifiers.
Interestingly these possesives make the encoding a lot faster again in
fancy-regex
.Before this change (but with large byte pair merge PR cherry-picked):
Same, with these changes applied:
Updating the regex libs makes it a tiny bit faster still:
This is almost 2x faster than before any of the optimizations.
Opened an issue for increasing the default backtrack limit, see: https://github.com/fancy-regex/fancy-regex/issues/134, but it shouldn't be necessary here anymore.