Closed noe closed 5 years ago
Hmmm, looks easy to implement but tricky to test.
Do you have any example protected_patterns_file
and related text that contains those patterns that can be tested? If you do, then I could easy code it up and write the test =)
Sorry, I don't have an example of such files. My intention was to avoid URLs being tokenized and I was planning to use a regex like this one:
import re
regex = re.compile(
r'(?:http|ftp)s?://' # http:// or https://
r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?))'
r'(?::\d+)?' # optional port
r'(?:/\w+)*'
r'(?:(?:\.[a-z]+)|/?)', re.IGNORECASE)
Then I found out that Moses tokenizer supported it and checked sacremoses for it because it's what we use.
P.S.: the regex above is used in our code but I don't know where I took it from; my past self wrote as a comment that it's loosely based on Django's URL validators but my present self can't see an evident connection with it.
@noe Django URL validators is a little too heavy to incorporate here.
I've tried incorporating the protected_patterns
feature from https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/basic-protected-patterns and a unittest case against the pattern you've listed in the previous comment in #46
On CLI:
$ pip install -U sacremoses>=0.0.19
$ wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/basic-protected-patterns
$ sacremoses tokenize -j 4 -p basic-protected-patterns < big.txt > big.txt.tok
100%|██████████████████████████████████| 128457/128457 [00:05<00:00, 22183.94it/s
In Python:
from sacremoses import MosesTokenizer
moses = MosesTokenizer()
text = "this is a webpage https://stackoverflow.com/questions/6181381/how-to-print-variables-in-perl that kicks ass"
expected_tokens = ['this', 'is', 'a', 'webpage',
'https://stackoverflow.com/questions/6181381/how-to-print-variables-in-perl',
'that', 'kicks', 'ass']
assert moses.tokenize(text, protected_patterns=moses.BASIC_PROTECTED_PATTERNS) == expected_tokens
# Testing against pattern from https://github.com/alvations/sacremoses/issues/35
noe_patterns = [r'(?:http|ftp)s?://' # http:// or https://
r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?))'
r'(?::\d+)?' # optional port
r'(?:/\w+)*'
r'(?:(?:\.[a-z]+)|/?)']
assert moses.tokenize(text, protected_patterns=noe_patterns) == expected_tokens
Added feature. Thanks again @noe ! c.f. #46
Hi @alvations , the above code snippet is just handling the case if there are string before and after the url and in fact it is taking the url only in specific format. So my question is if there's any way that it can handle any condition say like even if the text has only url it shouldn't tokenise or in some cases even if there's no text before or after the url?
Thanks in advance!
The original Moses tokenizer supports the
--protected
flag. It's effect is to accept a file with a list of regular expression that should be protected from tokenization.Under the hoods, the tokenizer masks each match of the regexes, then tokenizes, then unmasks.
Is this functionality in the roadmap of sacremoses?