hplt-project / sacremoses

Python port of Moses tokenizer, truecaser and normalizer
MIT License
487 stars 59 forks source link

compile regex objects ahead of time for improved perf. #133

Closed erip closed 1 year ago

erip commented 2 years ago

Compiles regexs where appropriate for improved perf for common operations (subs, searches, matches, finditers). Timeit info below for a microbenchmark (MT1 is original w/o compilation, MT2 is new w/ compilation just for comparison -- this PR replaces the original impl).

In [1]: lines = [line.strip() for line in open('big.txt') if line.strip()][:1000]

In [2]: from sacremoses.tokenize import MosesTokenizer as MT1

In [3]: from sacremoses.tokenize2 import MosesTokenizer as MT2

In [4]: mt1, mt2 = MT1(lang='en'), MT2(lang='en')

In [5]: %timeit [mt1.tokenize(line) for line in lines]
714 ms ± 18.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit [mt2.tokenize(line) for line in lines]
658 ms ± 11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
erip commented 2 years ago

As a quick note: if I replace import re with import regex as re, the timeit microbenchmark is 1.62 s ± 117 ms per loop (mean ± std. dev. of 7 runs, 1 loop each). Quite the penalty just by switching the regex engine!