Open Majdoddin opened 3 months ago
@bigheemseafood
Thanks for your work on this!
I noticed this code block which sounds like it would need to change along with these regexes?
@tmm1 I've implement it for encode_ordinary(). Than part is for unstable encoding. By the way, I'd really appreciate if you submit a review for this PR.
This PR realizes the wish expressed in current code to use the faster
Regex
.The text is splitted to pieces, before tokenization, according to regular expression patterns. This PR drops a lookahead part of the pattern, the part for catching the whitespaces, and handles the whitespaces with scripting instead, with mathematically provable exactly same output. This makes it possible to use linear-time
Regex
instead offancy-regex
, asRegex
does not support lookahead, resulting in a 14x speedup of pattern matching. As pattern matching currently comprises 90% of the encoding runtime, the total runtime is boosted 6x.Although
fancy_regex
delegates toRegex
, when the pattern has no special features, it is still some 10% slower in test, thus we directly useRegex
. This improvement is for pattern matching of the parts with ordinary text. Catching the special tokens is still done withfancy_regex
.