Closed j0hannes closed 2 years ago
Hi @j0hannes, sorry for the looong delay. I didn't receive an email notification (or it went to the spam folder) for this issue. Given that this package is intended as a wrapper for the original moses tokenizer (https://github.com/moses-smt/mosesdecoder/tree/master/scripts/tokenizer), I believe that we shouldn't attempt to fix this problem here, but in the Perl scripts instead.
I saw that you also posted an issue in sacremoses. Perhaps it makes more sense to fix this isssue in sacremoses than here, since it is a Python reimplementation of the original Perl scripts.
If the original Perl scripts are ever updated to fix this issue, then it would make sense to update the copies of those scripts within this package to inherit the fix.
I just tried the detokenizer, and, while punctuation marks seems not to pose any problem, it erroneously puts a white space inside contractions:
The result is
yesterday ’s reception
.