By combining the contractions to a single non-capturing group prefixed by ', we can speed up matches by roughly 20%.
By using possessive quantifiers for the cl100k_base in the word and punctuation groups we're avoiding some backtracking.
The last whitespace groups can also be simplified to have a single newline matched explicitly, since the previous whitespace would already match it.
Overall the regex matches the exact same sequence of characters as before for any case and for unicode sequences.
This is the first part of the optimizations I did for jtokkit, reducing the speed of the tokenization from ~10.5 seconds to ~1.6 seconds in several big steps.
If this change is accepted I'll continue migrating the changes I've made.
I've modified benchmark.py locally to measure the improvement:
By combining the contractions to a single non-capturing group prefixed by
'
, we can speed up matches by roughly 20%.By using possessive quantifiers for the
cl100k_base
in the word and punctuation groups we're avoiding some backtracking.The last whitespace groups can also be simplified to have a single newline matched explicitly, since the previous whitespace would already match it.
Overall the regex matches the exact same sequence of characters as before for any case and for unicode sequences.
This is the first part of the optimizations I did for jtokkit, reducing the speed of the tokenization from ~10.5 seconds to ~1.6 seconds in several big steps. If this change is accepted I'll continue migrating the changes I've made.
I've modified
benchmark.py
locally to measure the improvement:Here the speedup is as follows:
Before:
After regex optimization:
The other 50k tokenizers are also sped up slightly, not just the C100k.