hplt-project / sacremoses

Python port of Moses tokenizer, truecaser and normalizer
MIT License
486 stars 59 forks source link

Compile regexp in detokenizer #143

Closed jelmervdl closed 11 months ago

jelmervdl commented 1 year ago

Together with #133 this replaces #140.

main: cat big.txt | python -m sacremoses -l en detokenize > /dev/null
  Time (mean ± σ):     35.786 s ±  0.612 s    [User: 35.058 s, System: 0.475 s]
  Range (min … max):   34.669 s … 36.835 s    10 runs

this: cat big.txt | python -m sacremoses -l en detokenize > /dev/null
  Time (mean ± σ):      8.581 s ±  0.119 s    [User: 8.181 s, System: 0.383 s]
  Range (min … max):    8.453 s …  8.789 s    10 runs
ZJaume commented 1 year ago

I don't want to be picky, but does that big.txt contain tokenized sentences? Performance may be different if input is not tokenized?

jelmervdl commented 1 year ago

True, I assumed that it wouldn't matter that much for performance comparisons. I've now run the same thing on a tokenized version of big.txt. The difference is slightly smaller, but still big enough for this change I'd say.

main: cat big.tok.txt | python -m sacremoses -l en detokenize > /dev/null
  Time (mean ± σ):     34.814 s ±  0.724 s    [User: 34.226 s, System: 0.464 s]
  Range (min … max):   33.846 s … 36.157 s    10 runs

this: cat big.tok.txt | python -m sacremoses -l en detokenize > /dev/null
  Time (mean ± σ):      9.253 s ±  0.172 s    [User: 8.828 s, System: 0.381 s]
  Range (min … max):    9.060 s …  9.560 s    10 runs