hplt-project / sacremoses

Python port of Moses tokenizer, truecaser and normalizer
MIT License
486 stars 59 forks source link

Flag --protected from original Moses tokenizer #35

Closed noe closed 5 years ago

noe commented 5 years ago

The original Moses tokenizer supports the --protected flag. It's effect is to accept a file with a list of regular expression that should be protected from tokenization.

Under the hoods, the tokenizer masks each match of the regexes, then tokenizes, then unmasks.

Is this functionality in the roadmap of sacremoses?

alvations commented 5 years ago

Hmmm, looks easy to implement but tricky to test.

Do you have any example protected_patterns_file and related text that contains those patterns that can be tested? If you do, then I could easy code it up and write the test =)

noe commented 5 years ago

Sorry, I don't have an example of such files. My intention was to avoid URLs being tokenized and I was planning to use a regex like this one:

    import re
    regex = re.compile(
        r'(?:http|ftp)s?://'  # http:// or https://
        r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?))'
        r'(?::\d+)?'  # optional port
        r'(?:/\w+)*'
        r'(?:(?:\.[a-z]+)|/?)', re.IGNORECASE)

Then I found out that Moses tokenizer supported it and checked sacremoses for it because it's what we use.

P.S.: the regex above is used in our code but I don't know where I took it from; my past self wrote as a comment that it's loosely based on Django's URL validators but my present self can't see an evident connection with it.

alvations commented 5 years ago

@noe Django URL validators is a little too heavy to incorporate here.

I've tried incorporating the protected_patterns feature from https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/basic-protected-patterns and a unittest case against the pattern you've listed in the previous comment in #46


On CLI:

 $ pip install -U sacremoses>=0.0.19

 $ wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/basic-protected-patterns

 $ sacremoses tokenize -j 4 -p basic-protected-patterns < big.txt > big.txt.tok
100%|██████████████████████████████████| 128457/128457 [00:05<00:00, 22183.94it/s

In Python:

from sacremoses import MosesTokenizer

moses = MosesTokenizer()
text = "this is a webpage https://stackoverflow.com/questions/6181381/how-to-print-variables-in-perl that kicks ass"
expected_tokens = ['this', 'is', 'a', 'webpage',
                   'https://stackoverflow.com/questions/6181381/how-to-print-variables-in-perl',
                   'that', 'kicks', 'ass']
assert moses.tokenize(text, protected_patterns=moses.BASIC_PROTECTED_PATTERNS) == expected_tokens

# Testing against pattern from https://github.com/alvations/sacremoses/issues/35
noe_patterns = [r'(?:http|ftp)s?://'  # http:// or https://
    r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?))'
    r'(?::\d+)?'  # optional port
    r'(?:/\w+)*'
    r'(?:(?:\.[a-z]+)|/?)']
assert moses.tokenize(text, protected_patterns=noe_patterns) == expected_tokens
alvations commented 5 years ago

Added feature. Thanks again @noe ! c.f. #46

ganeshvictory commented 1 year ago

Hi @alvations , the above code snippet is just handling the case if there are string before and after the url and in fact it is taking the url only in specific format. So my question is if there's any way that it can handle any condition say like even if the text has only url it shouldn't tokenise or in some cases even if there's no text before or after the url?

Thanks in advance!