More robust testing for chaining sacremoses CLI

The CLI flags and chaining though pipeline should be tested with a little more robustness than just the examples in README.md

Not sure if it is still the case, but I found this from https://aclanthology.org/2020.wmt-1.88.pdf

During our early experiments we noticed several issues with our preprocessing pipeline which we fixed for the later experiments. In particular, we noticed that some sacremoses command line flags were broken, and the out-of-the-box inference tool from FairSeq did not fully replicate the preprocessing pipeline used for training (punctuation normalization and vocabulary-aware subword segmentation). The original pipeline (called v1) was used for our baseline models. The later experiments used the fixed implementations of sacremoses and FairSeq (denoted by v2).

hplt-project / sacremoses

More robust testing for chaining sacremoses CLI #150