preprocessing.strip_punctuation does not handle Unicode

sciatro commented 4 years ago

Problem description

RE_PUNCT in parsing/preprocessing.py, which is the substance of preprocessing.strip_punctuation does not consider Unicode punctuation.

RE_PUNCT (=re.compile(r'([%s])+' % re.escape(string.punctuation), re.UNICODE)) depends on the standard library string module's punctuation string which is limited to ascii punctuation.

Steps/code/corpus to reproduce

>>> from gensim.parsing import preprocessing
>>> preprocessing.strip_punctuation('This is a “quoted string” which has the typographic quotes whereas "this one" does not.')
'This is a “quoted string” which has the typographic quotes whereas  this one  does not '

For the above input I think the correct output would be:

'This is a  quoted string  which has the typographic quotes whereas  this one  does not '

Possible solutions

In the above example my choice of typographic quotes was unimportant but dodges the hard part of a solution which will be a suitable definition of punctuation given the number of possibilities in unicode and ambiguity around some associated uses of those possibilities.

I can think of three large classes of response:

Anything other than ascii is ambiguous, leave it as is and document
Use a hand defined map of Unicode equivalency to ascii punctuation, e.g. Unidecode
Exclude based on database category, e.g. this SO answer

I found this helpful in exploring possible answers to my particular use case:

import unicodedata
import sys

to_look_at = [(i, chr(i)) for i in range(sys.maxunicode) if unicodedata.category(chr(i))[0] in ('P', 'S')]

Thanks for all the hard work on this great library.

piskvorky commented 4 years ago

You are right. All these functions in gensim.preprocessing are fairly naive, they won't stand up to deep industry use. My recommendation for non-academic (=non-toy) projects is always to roll your own preprocessing for your problem domain, because all NLP libraries (gensim included) are kinda generic and rubbish at this. And then for the rest of your pipeline, it becomes garbage-in, garbage-out…

We're still deciding whether to axe gensim.preprocessing completely, so as not to mislead users into unrealistic expectations about its ability, or keep & improve it incrementally. CC @mpenkov .

sciatro commented 4 years ago

As a batteries includes set of first-pass utilities I find them very useful when starting any project. It's really just this issue of punctuation that comes up with any regularity for me.

piskvorky commented 4 years ago

OK good.

A PR to improve the preprocessing functionality (~better punctuation) is welcome. As long as you're aware the future of preprocessing is uncertain, feel free to use & improve it!

sciatro commented 4 years ago

In terms of improving. Guess question is in what way.

Of the three conceptual directions I outlined under Possible solutions above:

Document the limitation and let it be may be best given uncertainty. The patch would just be to add the qualification of "ASCII" to the doc for strip_punctuation, i.e. ("""Replace punctuation characters... becomes """Replace ASCII punctuation characters...).
Using equivalency tables to mutate the input and then applying ASCII rules works well as a simple first pass in my experience but does require a new dependency or a commitment to maintaining the tables. Neither a new dependency nor a data maintenance project seems inline with the ambivalence about the future of this functionality. If you're open to a new dependency the patch is just to pass the input string through unidecode.unidecode.
Character removal based on unicode categories is easy enough to do (± special Emoji handling 🤷‍♀️). Doing so is mostly an architectural question about whether you want to put the literal in the source or enumerate the instances of each category at runtime, i.e. do you want to put the to_look_at literal value under version control or put for i in range(sys.maxunicode) if unicodedata.category(chr(i)) under version control? That architectural choice would require perspective on maintainability of a big important library (which I don't have).

TL;DR:

If you're going to pull out the functionality: I'd vote option 1
If you want to keep the functionality: I'd vote option 3 (and I'm happy to participate in conversation about pros and cons of two approaches / produce patch for preferred approach)
If you're an end user looking for a low effort next step, look at point 2 above

piskvorky commented 4 years ago

Option 1) sounds good to me, as a first step. As always, a pull request with the fix (fixed documenation) is welcome :)

mpenkov commented 4 years ago

My recommendation for non-academic (=non-toy) projects is always to roll your own preprocessing for your problem domain, because all NLP libraries (gensim included) are kinda generic and rubbish at this.

@piskvorky Perhaps we should make this obvious in the module docstring?

piskvorky commented 4 years ago

Maybe. We talk about it in the core tutorials.

piskvorky / gensim