hplt-project / sacremoses

Python port of Moses tokenizer, truecaser and normalizer
MIT License
486 stars 59 forks source link

Support of Moses tokenizer Perl scripts #44

Closed loretoparisi closed 5 years ago

loretoparisi commented 5 years ago

In my tokenization pipeline I run several Moses perl scripts like

def TokenLine(line, lang='en', lower_case=True, romanize=False):
    assert lower_case, 'lower case is needed by all the models'
    roman = lang if romanize else 'none'
    tok = check_output(
            REM_NON_PRINT_CHAR
            + '|' + NORM_PUNC + lang
            + '|' + DESCAPE
            + '|' + MOSES_TOKENIZER + lang
            input=line,
            encoding='UTF-8',
            shell=True)
    return tok.strip()

where I have

MOSES_BDIR = os.path.join( TOOL_BASE_DIR, 'moses-tokenizer/tokenizer/')
MOSES_TOKENIZER = MOSES_BDIR + 'tokenizer.perl -q -no-escape -threads 20 -l '
NORM_PUNC = MOSES_BDIR + 'normalize-punctuation.perl -l '
DESCAPE = MOSES_BDIR + 'deescape-special-chars.perl'
REM_NON_PRINT_CHAR = MOSES_BDIR + 'remove-non-printing-char.perl'

So I run normalize-punctuation.perl, remove-non-printing-char.perl and tokenizer.perl. Are all these script supported by Sacremoses?

In my understanding I should do like

from sacremoses import MosesTokenizer

mtok = MosesTokenizer(lang='fr')
tokenized_docs = [mtok.tokenize(line) for line text.splitlines() ]

while from the command line options I can see the --xml-unescape to unescape special characters for XML, that should match the deescape-special-chars.perl script. Regarding the non printing chars from the remove-non-printing-char.perl my understanding is that are handled by the api without other options like

>>> mtok.tokenize("This , is a sentence with weird \xbb symbols \u2026 appearing everywhere \xbf")
['This', ',', 'is', 'a', 'sentence', 'with', 'weird', '»', 'symbols', '…', 'appearing', 'everywhere', '¿']
>>> 

So what I did was having the signature of tokenize

def tokenize(self, text, aggressive_dash_splits=False, return_str=False, escape=True):

to use it like

mtok.tokenize(string, escape=True, return_str=True, aggressive_dash_splits=False)

because I want a string output (return_str=True), to not handle hypens (aggressive_dash_splits=False). Regarding the escaping of HTML chars in the perl version I don't know if escape=True of the python version, will match it, since here it handles the XML escaping.

alvations commented 5 years ago

@loretoparisi currently, only these tools from Moses tokenizer scripts and Moses recaser scripts are supported:

A simple cheatsheet (we need to do more documentation!):

Moses Sacremoses (Py) Sacremoses (CLI)
tokenizer.perl -l en MosesTokenizer(lang='en').tokenize sacremoses tokenize -l en
detokenizer.perl -l en MosesTokenizer(lang='en').detokenize sacremoses detokenize -l en
train-truecaser.perl MosesTruecaser().train sacremoses train-truecase
truecase.perl MosesTruecaser().truecase sacremoses truecase
truecase.perl MosesTruecaser().detruecase sacremoses detruecase
normalize-punctuation.perl MosesPunctNormalizer(lang='en').normalize sacremoses normalize -l en
loretoparisi commented 5 years ago

@alvations thank you very much for this comparison table!!! It would be worth to add it to the docs! Playing with the tool I was finally able to successfully make it working as Moses in Facebook LASER biLSTM - see here https://github.com/facebookresearch/LASER/issues/55

alvations commented 5 years ago

Congrats!

P/S: I've just did some additional feature upgrades so do ensure pip install -U sacremoses>=0.0.19 =)

aj7tesh commented 3 years ago

@alvations @loretoparisi is the moses tokenizer and detokenizer in sacremoses exactly the same as tokenizer.perl script and detokenizer.perl scripts for english language?

loretoparisi commented 3 years ago

@aj7tesh in my case, I have used both according to the table above.

aj7tesh commented 3 years ago

@loretoparisi that means for english Sacremoses (Py) tokenizer and detokenizer are similar to perl scripts. for english. Have I got it correctly?

aj7tesh commented 3 years ago

and what about these do you have any idea https://pypi.org/project/mosestokenizer/#description and https://pypi.org/project/fast-mosestokenizer/ Actually I want exact replacement of tokenizer.perl and detokenizer.perl in python pip package. if you can help about it.

alvations commented 3 years ago

Yes @aj7tesh for English, they should be the same.

https://github.com/mingruimingrui/fast-mosestokenizer should be a newer library and people should use that more.

aj7tesh commented 3 years ago

Great, so looks like I can go ahead with either sacremoses or pip install mosestokenizer. And that apostrophe issue is no longer a difference right in the latest sacremose version.? @alvations