hplt-project / sacremoses

Python port of Moses tokenizer, truecaser and normalizer
MIT License
486 stars 59 forks source link

Space deduplication #79

Open alvations opened 4 years ago

alvations commented 4 years ago

Moses default behavior is to use a regex to deduplicate spaces, https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L507

But in Python, this can be don't by doing " ".join(text.split())

Would it be appropriate to change the space de-duplication behavior in sacremoses to use " ".join(text.split()) ?

goodmami commented 4 years ago

The first issue is whether Python's str.split() uses the same definition of a space as \s in Python regular expressions. To test this, I got took the list of unicode whitespace characters from Wikipedia and compared str.split() with re.sub():

import re

spaces = (
  '\t',      # tab
  '\n',      # newline
  '\v',      # vertical tab
  '\f',      # form feed
  '\r',      # carriage return
  '  ',      # space
  '\u0085',  # next line
  '\u00a0',  # non-breaking space
  '\u1680',  # ogham space mark
  '\u2000',  # en quad
  '\u2001',  # em quad
  '\u2002',  # en space
  '\u2003',  # em space
  '\u2004',  # three-per-em space
  '\u2005',  # four-per-em space
  '\u2006',  # six-per-em space
  '\u2007',  # figure space
  '\u2008',  # punctuation space
  '\u2009',  # thin space
  '\u200a',  # hair space
  '\u2028',  # line separator
  '\u2029',  # paragraph separator
  '\u202f',  # narrow no-break space
  '\u205f',  # medium mathematical space
  '\u3000',  # ideographic space
)

for sp in spaces:
    s = f'a{sp}b'
    assert ' '.join(s.split()) == re.sub(r'\s+', ' ', s)

I get no AssertionError when I run this with Python 3.6, so that's good news. Note that this is only the case if the regular expression is compiled in unicode mode, which is the default for Python 3. There may be some implementations or builds of Python that do not consider unicode spaces, but I am ignoring that possibility.

The next question is whether ' '.join(s.split()) results in the same output as re.sub(r'\s+', ' ', s), and this is not always true, particularly when there are leading or trailing spaces:

>>> import re
>>> s = ' a   b '
>>> ' '.join(s.split())
'a b'
>>> re.sub(r'\s+', ' ', s)
' a b '

As long as you are currently always doing s.strip() before or after the space deduplication, then you should be safe.

The final question (that I came up with) is whether doing so has any performance gains. I think in general a simple string operation is faster than invoking the regex engine, and timeit confirms it here:

$ python3 -m timeit -uusec -s 'import re; r = re.compile(r"\s+"); s = "a " * 20' '" ".join(s.split())'
500000 loops, best of 5: 0.521 usec per loop
$ python3 -m timeit -uusec -s 'import re; r = re.compile(r"\s+"); s = "a " * 20' 'r.sub(" ", s)'
100000 loops, best of 5: 3.73 usec per loop

So assuming you are currently removing leading or trailing spaces, deduplicating spaces with str.split() seems like a good idea.

alvations commented 4 years ago

Thanks @goodmami for the analysis!!

Hmmm, seems like considering things like em-space and ideographic space might be problematic because I think it affects other regexes in Moses. Have to take a closer look at other regexes that manipulates/uses these "alternative" spaces in the moses' list of regexes.

goodmami commented 4 years ago

Yes, all I have shown was that the basic string operations are suitable to replace Python's re.sub() if the leading and trailing spaces are accounted for, and I make no claims as to either way's compatibility with Moses. If you're currently using re.sub() without trouble, then I don't expect any new issues to arise, but if it differs from Moses in some way, re.sub() would be more flexible if you need special-casing.