hplt-project / sacremoses

Python port of Moses tokenizer, truecaser and normalizer
MIT License
486 stars 59 forks source link

Escaped characters not converted back into special characters after tokenization #27

Closed BLKSerene closed 5 years ago

BLKSerene commented 5 years ago

Hi, I'm using SacreMoses 0.0.7, special characters like [ ] < > are escaped when the text is being tokenized (using MosesTokenizer.tokenize), and are left as is in the results (the examples in the doc show the same behavior).

Is this the expected behavior and is there any reason to do this? (For example, I would expect &apos;s to be converted back to 's in the results.)

To add, MosesTokenizer.penn_tokenize would convert square brackets to -LSB- and -RSB-.

>>> import sacremoses
>>> text = 'English is a West Germanic language that was first spoken in early medieval England and eventually became a global lingua franca.[4][5]'
>>> moses_tokenizer = sacremoses.MosesTokenizer(lang = 'en')
>>> moses_tokenizer.tokenize(text)
['English', 'is', 'a', 'West', 'Germanic', 'language', 'that', 'was', 'first', 'spoken', 'in', 'early', 'medieval', 'England', 'and', 'eventually', 'became', 'a', 'global', 'lingua', 'franca', '.', '&#91;', '4', '&#93;', '&#91;', '5', '&#93;']
>>> moses_tokenizer.penn_tokenize(text)
['English', 'is', 'a', 'West', 'Germanic', 'language', 'that', 'was', 'first', 'spoken', 'in', 'early', 'medieval', 'England', 'and', 'eventually', 'became', 'a', 'global', 'lingua', 'franca', '.', '-LSB-', '4', '-RSB-', '-LSB-', '5', '-RSB-']
alvations commented 5 years ago

Both behavior to escape special xml characters for default tokenizer and Penn specific conversions are expected from the original https://github.com/moses-smt/mosesdecoder/tree/master/scripts/tokenizer

From https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L291 , use something like

>>> from sacremoses import MosesTokenizer
>>> text = 'English is a West Germanic language that was first spoken in early medieval England and eventually became a global lingua franca.[4][5]'
>>> mt =MosesTokenizer(lang = 'en')
>>> mt.tokenize(text, escape=False)
['English', 'is', 'a', 'West', 'Germanic', 'language', 'that', 'was', 'first', 'spoken', 'in', 'early', 'medieval', 'England', 'and', 'eventually', 'became', 'a', 'global', 'lingua', 'franca', '.', '[', '4', ']', '[', '5', ']']
BLKSerene commented 5 years ago

Thanks, I got it.