Closed BLKSerene closed 5 years ago
Both behavior to escape special xml characters for default tokenizer and Penn specific conversions are expected from the original https://github.com/moses-smt/mosesdecoder/tree/master/scripts/tokenizer
From https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L291 , use something like
>>> from sacremoses import MosesTokenizer
>>> text = 'English is a West Germanic language that was first spoken in early medieval England and eventually became a global lingua franca.[4][5]'
>>> mt =MosesTokenizer(lang = 'en')
>>> mt.tokenize(text, escape=False)
['English', 'is', 'a', 'West', 'Germanic', 'language', 'that', 'was', 'first', 'spoken', 'in', 'early', 'medieval', 'England', 'and', 'eventually', 'became', 'a', 'global', 'lingua', 'franca', '.', '[', '4', ']', '[', '5', ']']
Thanks, I got it.
Hi, I'm using SacreMoses 0.0.7, special characters like
[
]
<
>
are escaped when the text is being tokenized (usingMosesTokenizer.tokenize
), and are left as is in the results (the examples in the doc show the same behavior).Is this the expected behavior and is there any reason to do this? (For example, I would expect
's
to be converted back to's
in the results.)To add,
MosesTokenizer.penn_tokenize
would convert square brackets to-LSB-
and-RSB-
.