luismsgomes / mosestokenizer

GNU Lesser General Public License v2.1
20 stars 7 forks source link

Detokenizing English - no support for apostrophes? #1

Closed j0hannes closed 2 years ago

j0hannes commented 4 years ago

I just tried the detokenizer, and, while punctuation marks seems not to pose any problem, it erroneously puts a white space inside contractions:

import mosestokenizer                                                                                                                                  
tokens = 'yesterday ’s reception'.split(' ')                                                                                                           
with mosestokenizer.MosesDetokenizer('en') as detokenize: 
    print(detokenize(tokens)) 

The result is yesterday ’s reception.

luismsgomes commented 2 years ago

Hi @j0hannes, sorry for the looong delay. I didn't receive an email notification (or it went to the spam folder) for this issue. Given that this package is intended as a wrapper for the original moses tokenizer (https://github.com/moses-smt/mosesdecoder/tree/master/scripts/tokenizer), I believe that we shouldn't attempt to fix this problem here, but in the Perl scripts instead.

I saw that you also posted an issue in sacremoses. Perhaps it makes more sense to fix this isssue in sacremoses than here, since it is a Python reimplementation of the original Perl scripts.

If the original Perl scripts are ever updated to fix this issue, then it would make sense to update the copies of those scripts within this package to inherit the fix.