hplt-project / sacremoses

Python port of Moses tokenizer, truecaser and normalizer
MIT License
486 stars 59 forks source link

More robust testing for chaining sacremoses CLI #150

Open alvations opened 2 months ago

alvations commented 2 months ago

The CLI flags and chaining though pipeline should be tested with a little more robustness than just the examples in README.md

Not sure if it is still the case, but I found this from https://aclanthology.org/2020.wmt-1.88.pdf

During our early experiments we noticed several issues with our preprocessing pipeline which we fixed for the later experiments. In particular, we noticed that some sacremoses command line flags were broken, and the out-of-the-box inference tool from FairSeq did not fully replicate the preprocessing pipeline used for training (punctuation normalization and vocabulary-aware subword segmentation). The original pipeline (called v1) was used for our baseline models. The later experiments used the fixed implementations of sacremoses and FairSeq (denoted by v2).