hplt-project / sacremoses

Python port of Moses tokenizer, truecaser and normalizer
MIT License
487 stars 59 forks source link

Web and basic protected patterns by default #138

Open samirsalman opened 1 year ago

samirsalman commented 1 year ago

By default the library is not using protected patterns such of WEB_PROTECTED_PATTERNS which contains for example URLs and emails patterns.

# Example
tokenizer.tokenize("http://www.someurl.com")

# Expected output
["http://www.someurl.com"]

# sacremoses output
["http", ":",  "/", "/", "www.someurl.com"]

I suggest to use WEB_PROTECTED_PATTERNS and BASIC_PATTERNS by default when user does not specify protected patterns. This allow user to avoid issues with URLs tokenization when use tokenize function with default arguments. The user can still specify different protected patterns or force to don't use protected patterns by setting protected_patterns parameter to empty list:

tokenizer.tokenize("http://www.someurl.com",protected_patterns=[])