Open sciatro opened 4 years ago
You are right. All these functions in gensim.preprocessing
are fairly naive, they won't stand up to deep industry use. My recommendation for non-academic (=non-toy) projects is always to roll your own preprocessing for your problem domain, because all NLP libraries (gensim included) are kinda generic and rubbish at this. And then for the rest of your pipeline, it becomes garbage-in, garbage-out…
We're still deciding whether to axe gensim.preprocessing
completely, so as not to mislead users into unrealistic expectations about its ability, or keep & improve it incrementally. CC @mpenkov .
As a batteries includes set of first-pass utilities I find them very useful when starting any project. It's really just this issue of punctuation that comes up with any regularity for me.
OK good.
A PR to improve the preprocessing functionality (~better punctuation) is welcome. As long as you're aware the future of preprocessing
is uncertain, feel free to use & improve it!
In terms of improving. Guess question is in what way.
Of the three conceptual directions I outlined under Possible solutions above:
Document the limitation and let it be may be best given uncertainty. The patch would just be to add the qualification of "ASCII" to the doc for strip_punctuation
, i.e. ("""Replace punctuation characters...
becomes """Replace ASCII punctuation characters...
).
Using equivalency tables to mutate the input and then applying ASCII rules works well as a simple first pass in my experience but does require a new dependency or a commitment to maintaining the tables. Neither a new dependency nor a data maintenance project seems inline with the ambivalence about the future of this functionality. If you're open to a new dependency the patch is just to pass the input string through unidecode.unidecode
.
Character removal based on unicode categories is easy enough to do (± special Emoji handling 🤷♀️). Doing so is mostly an architectural question about whether you want to put the literal in the source or enumerate the instances of each category at runtime, i.e. do you want to put the to_look_at
literal value under version control or put for i in range(sys.maxunicode) if unicodedata.category(chr(i))
under version control? That architectural choice would require perspective on maintainability of a big important library (which I don't have).
TL;DR:
Option 1) sounds good to me, as a first step. As always, a pull request with the fix (fixed documenation) is welcome :)
My recommendation for non-academic (=non-toy) projects is always to roll your own preprocessing for your problem domain, because all NLP libraries (gensim included) are kinda generic and rubbish at this.
@piskvorky Perhaps we should make this obvious in the module docstring?
Maybe. We talk about it in the core tutorials.
Problem description
RE_PUNCT
inparsing/preprocessing.py
, which is the substance ofpreprocessing.strip_punctuation
does not consider Unicode punctuation.RE_PUNCT
(=re.compile(r'([%s])+' % re.escape(string.punctuation), re.UNICODE)
) depends on the standard library string module's punctuation string which is limited to ascii punctuation.Steps/code/corpus to reproduce
For the above input I think the correct output would be:
Possible solutions
In the above example my choice of typographic quotes was unimportant but dodges the hard part of a solution which will be a suitable definition of punctuation given the number of possibilities in unicode and ambiguity around some associated uses of those possibilities.
I can think of three large classes of response:
I found this helpful in exploring possible answers to my particular use case:
Thanks for all the hard work on this great library.