Open paplorinc opened 4 months ago
Fixes the crash in https://github.com/openai/tiktoken/issues/245 by prohibiting the regex engine from backtracking catastrophically via possessive quantifiers.
Interestingly these possesives make the encoding a lot faster again in fancy-regex.
fancy-regex
Before this change (but with large byte pair merge PR cherry-picked):
num_threads: 1, num_bytes: 98379553 tiktoken 11,946,036 bytes / s tiktoken 11,961,343 bytes / s tiktoken 11,995,846 bytes / s tiktoken 11,951,263 bytes / s tiktoken 11,983,405 bytes / s
Same, with these changes applied:
num_threads: 1, num_bytes: 98379553 tiktoken 14,511,827 bytes / s tiktoken 14,638,134 bytes / s tiktoken 14,644,029 bytes / s tiktoken 14,729,030 bytes / s tiktoken 14,666,903 bytes / s
Updating the regex libs makes it a tiny bit faster still:
num_threads: 1, num_bytes: 98379553 tiktoken 14,485,590 bytes / s tiktoken 14,854,049 bytes / s tiktoken 14,891,086 bytes / s tiktoken 14,843,007 bytes / s tiktoken 14,874,520 bytes / s
This is almost 2x faster than before any of the optimizations.
Opened an issue for increasing the default backtrack limit, see: https://github.com/fancy-regex/fancy-regex/issues/134, but it shouldn't be necessary here anymore.
Fixes the crash in https://github.com/openai/tiktoken/issues/245 by prohibiting the regex engine from backtracking catastrophically via possessive quantifiers.
Interestingly these possesives make the encoding a lot faster again in
fancy-regex
.Before this change (but with large byte pair merge PR cherry-picked):
Same, with these changes applied:
Updating the regex libs makes it a tiny bit faster still:
This is almost 2x faster than before any of the optimizations.
Opened an issue for increasing the default backtrack limit, see: https://github.com/fancy-regex/fancy-regex/issues/134, but it shouldn't be necessary here anymore.