`sklearn_api.D2VTransformer` input should match other sklearn text preprocessing

aspanghe commented 5 years ago

Hi all,

Thanks for the great tool, I've been using gensim for years and I love it.

One thing that would be a huge convenience and take literally 2 seconds to implement would be extend sklearn_api.D2VTransformer to handle doc-inputs that are strings.

Currently, the module expects each doc to be a list of strings. I understand that users might want to chunk string for various purposes (clustering noun phrases), but this precludes a valuable use-case for this model.

Consider the common supervised model-comparison:

p_1 = Pipeline([('cv', CountVectorizer()), ('clf', clf)]) 
p_2 = Pipeline([('tfidf', TfidfVectorizer()), ('clf', clf)]) 
p_3 = Pipeline([('d2v', D2VTransformer()), ('clf', clf)])

for p in [p_1, p_2, p_3]:
    ## compare accuracies...

Right now, this will fail, because cv and tfidf do fit_transform on a list of strings, while d2v does fit_transform on a list of lists of strings.

This would entail a very simple change to these lines of code:

https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/sklearn_api/d2vmodel.py#L198-L199

to first check if doc.split() correctly processes, otherwise wrapping in a list as is currently there.

I'm happy to make the change, but please let me know if there's a compelling reason to keep the current behavior.

Alex

piskvorky commented 5 years ago

Hi @aspanghe , doc2vec (as well as word2vec, fasttext, etc) expect a list of words for a single document.

Then a corpus (X in sklearn-speak) is a sequence of such lists of words.

What exactly is the inconsistency?

If you're suggesting split() as a built-in hardwired tokenization step, I'm -1 on that (that's not a good tokenization).

aspanghe commented 5 years ago

Hi @piskvorky thank you for the response, and thanks for the great package -- like I said, years of use here.

I assume that the purpose of these SKLearn wrapper methods is to make gensim libraries -- word2vec, doc2vec, LDA -- share the same interface as SKLearn packages. If true, then this is great: a user can easily compare document-processing using Gensim packages with document-processing using SKLearn packages, and any other text-processing.

(Without digressing too much, I think what sklearn has done in terms of standardizing and abstracting model interfaces to allow for a model-agnostic comparison framework is it's single most important contribution.)

In the example I gave -- excuse the lack of imports -- I'm comparing text-classification pipelines that start with an embedding layer (SKLearn: {TFIDF, CountVec}, Gensim: {Doc2Vec}) and finish with some classifier. (Feeding document vectors into supervised models has been the most common use-case I've seen for Doc2Vec, but you can imagine a similar comparison pipeline for any task that starts with a vector-embedding of documents.)

The inconsistency thus is not within Gensim, it's between Gensim and SKLearn. Gensim expects a corpus to be a list of lists, but SKLearn expects lists of strings. Thus, a side-by-side comparison requires some awkwardness on the user-side. Since Gensim is the one interfacing with Sci-Kit, I think it's on Gensim to match Sci-Kit's inputs, and not the other way around.

Again, correct me if I'm wrong, but I see no other higher purpose for these SKLearn wrapper methods. Is there one?

I'm tokenization-method agnostic, so feel free to suggest another!

Alex

piskvorky commented 5 years ago

You're completely right regarding the purpose of these wrappers (and agreed on sklearn's lego-like awesomeness).

I guess I'm not clear on what the "desired behaviour" is. If the input is a "list of strings" -- what does each string represent?

If the answer is "full document", then that cannot be an input to Gensim. Gensim models don't do tokenization, they expect already-tokenized input.

So any preprocessing / tokenization steps must happen inside your sklearn pipeline, prior to using D2V. Something like p_4 = Pipeline([('doc2tokens', Tokenizer()), ('d2v', D2VTransformer()), ('clf', clf)]), where Tokenizer is something that splits your documents (strings) into tokens (lists of strings), using whatever preprocessing logic is appropriate for your application.

Does that make sense?

aspanghe commented 5 years ago

Yes, you're correct, the answer is "full document".

The answer you suggest is indeed possible, but in my opinion it's not ideal. If the purpose of these wrappers is to match SKLearn's interface, then I'd advocate making these as functionally similar to SKLearn's feature-extraction steps as possible. For instance, CountVectorizer has a number of pass-ins:

CountVectorizer(input=’content’, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), analyzer=’word’ ....)

That determine how to tokenize strings. One can argue that taken together all of these pass-ins make SKLearn feature-extractors overloaded. But, I think expecting a user to make their own tokenizer class is overkill, and it will limit the number of users who want to go through the effort of using your wonderful libraries in SKLearn pipelines. Especially if they're just trying to match SKLearn defaults, I think most people would expect similar pass-ins and similar defaults.

SKLearn defaults it's tokenizer for CountVec here:

https://github.com/scikit-learn/scikit-learn/blob/7b136e9/sklearn/feature_extraction/text.py#L261-L266

Where token_pattern is a regex with the default: token_pattern=r"(?u)\b\w\w+\b"

Again it's your call, I'm just saying, going back to the point of these wrappers, I think most users would expect to be able to compare D2V directly with SKLearn Feature extractors like CountVec, TfidfVec or LDAVec without doing extra boilerplate.

aspanghe commented 5 years ago

By the way, I'd be more than happy with implementation, if you agree with me. I think it'd really help make these wrappers more useful

piskvorky commented 5 years ago

I guess we could have the wrapper accept a tokenizer parameter, which would be a callable (code injection) that transforms unicode strings into a list of unicode strings. But I think a separate pipeline step is both cleaner and more flexible.

@mpenkov thoughts?

mpenkov commented 5 years ago

I agree with @piskvorky, I think this should be a separate pipeline step.

Yes, it places additional burden on the shoulder of the user: they have to write their own tokenizer class. However, in my opinion, this burden "goes with the territory" of integrating multiple libraries (in this case, gensim and sklearn) together.

I think we should author a recipe for integrating gensim in an sklearn pipeline and include it in our documentation. An example of the tokenizer class should go there.

piskvorky commented 5 years ago

Yes, an example sklearn Pipeline step for tokenization would go a long way. Especially if we make it easy to discover for users.

I'd like to encourage users to adapt that blueprint to match their own preprocessing needs, instead of relying on some hard-wired parameters. This follows my experience of good preprocessing being critical for ML success (and common libs like NLTK, spacy, gensim.utils.simple_preprocess etc not doing a great job, outside of easy well-behaved domains like "news articles"").

aspanghe commented 5 years ago

This makes sense. Yeah, the black magic of NLP, in general, is that almost all papers talk about their wonderful method, but leave out all the details about their preprocessing, which often has more impact.

I'd love to see a more robust discussion of preprocessing in common libraries you mention. Maybe this is the right way to start.

Let me know if I can be of any help!

Alex

mpenkov commented 5 years ago

Alex, are you able to write a tutorial (or even some sample code) on how to achieve what you are trying to do with gensim? A PR would be awesome.

aspanghe commented 5 years ago

Yes indeed, I can write a tutorial.

Is your preferred way of uploading tutorials via ipython notebook? That would be my preferred way, but I can match the style of your website

Alex

On Fri, Mar 8, 2019 at 4:30 PM Michael Penkov notifications@github.com wrote:

Alex, are you able to write a tutorial (or even some sample code) on how to achieve what you are trying to do with gensim? A PR would be awesome.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/RaRe-Technologies/gensim/issues/2403#issuecomment-471124567, or mute the thread https://github.com/notifications/unsubscribe-auth/AoxWeneWZzOepxIDmEDwXj1oBRFpdQJfks5vUwChgaJpZM4bh-XE .

piskvorky commented 5 years ago

I'd prefer an example, or at least an in-your-face link to an example we could re-use across the *2vec wrappers, directly in https://radimrehurek.com/gensim/sklearn_api/d2vmodel.html. That is, in the module docstring.

Maybe that's a question for you @aspanghe : as a user, where would you look for / expect to see such information?

mpenkov commented 5 years ago

@aspanghe We are currently thinking about the best way to present our docs, because the current arrangement is due for an overhaul. So, there is no one best way for you to author the tutorial at the moment.

I think the most important thing here is content, not format. If you haven't seen this already, I recommend you check out this write up to get you into the right mode for writing a tutorial (and documentation in general). Come to think of it, we may be looking for more of a "how to" guide (e.g. how to integrate gensim in a sklearn pipeline) as opposed to a tutorial here. This tutorial vs howto point is totally open for discussion, but I think having a clear goal would help here.

With the format, if you prefer an ipython notebook, then it's fine by me. Once we work out what we want to do with our documentation, we may reformat your ipython notebook as something else.

aspanghe commented 5 years ago

I'll get a draft out today hopefully.

I agree -- I think as a user, an "in-you-face" example directly in the docs is the first thing I'd look for. Then, a how-to on using Gensim in a larger SKLearn comparison pipeline would probably help.

I also think a mini-tutorial is in order -- I think your design decision here to make the tokenizing explicit necessitates a mini-tutorial on different tokenizing approaches, perhaps after the how-to. I don't think there are any good tutorials on tokenizing philosophies or strategies, and if you're asking the user to do it explicitly, you're elevating it to the level of importance where some deeper guidance would be helpful. I certainly am no expert on tokenizing, given the genesis of this whole discussion, so maybe I'll throw a rough draft together and then solicit comments from you?

Thanks so much for your time here, guys, and allowing me to contribute, and thank you for the documentation tips, @mpenkov.

Alex

On Sat, Mar 9, 2019 at 4:36 AM Michael Penkov notifications@github.com wrote:

@aspanghe https://github.com/aspanghe We are currently thinking about the best way to present our docs, because the current arrangement is due for an overhaul. So, there is no one best way for you to author the tutorial at the moment.

I think the most important thing here is content, not format. If you haven't seen this already, I recommend you check out this write up https://www.divio.com/blog/documentation/ to get you into the right mode for writing a tutorial (and documentation in general). Come to think of it, we may be looking for more of a "how to" guide (e.g. how to integrate gensim in a sklearn pipeline) as opposed to a tutorial here. This tutorial vs howto point is totally open for discussion, but I think having a clear goal would help here.

With the format, if you prefer an ipython notebook, then it's fine by me. Once we work out what we want to do with our documentation, we may reformat your ipython notebook as something else.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/RaRe-Technologies/gensim/issues/2403#issuecomment-471173382, or mute the thread https://github.com/notifications/unsubscribe-auth/AoxWegNHoQQljLD8dXksGuqyqpKIyXfGks5vU6q0gaJpZM4bh-XE .

alex2awesome commented 5 years ago

Here's a rough-draft notebook showing how to compare gensim sklearn wrappers alongside other sklearn methods... let me know if you'd like me to take it in a different direction:

https://github.com/alex2awesome/gensim-sklearn-tutorial/blob/master/notebooks/gensim-in-sklearn-pipelines.ipynb

I noticed the following three points that I think could be small fixes to make things easier:

Dictionary2Bow -- can't that be fit into the LDA Transformer fit method? I don't see the same kind of user-judgement here required as in tokenizing, which is more requirements-driven
LDA, D2V, W2v to type-checks on doc[0]..... this fails when using non zero-indexed inputs like dataframes.
W2VTransformer doesn't have any handling for train/test splits and fails on out-of-vocabulary errors.

alex2awesome commented 5 years ago

Oh, btw, this is my other github account. @aspanghe is my school handle, this is my personal handle, so easier to put public files on here

piskvorky / gensim

`sklearn_api.D2VTransformer` input should match other sklearn text preprocessing #2403