make_pipeline with gensim's doc2vec

tommydino93 commented 3 years ago

Hi All!

I am trying to apply explainer.explain_instance with the doc2vec embedding provided by gensim and a random forest classifier. I managed to reproduce this example with tfidf, but I don't manage to create an sklearn pipeline with Doc2Vec (see function extract_lime_explanation_d2v)

Any help would be appreciated :)

Here's the (pseudo) code I have so far:

# ----------------------------------------- IMPORTS -----------------------------------------
import os
import pandas as pd
import numpy as np
from gensim.models.doc2vec import TaggedDocument
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from gensim.models import Doc2Vec
# probably some other missing

# ----------------------------------------- DEFINE FUNCTIONS -----------------------------------------
def create_doc2vec_model(alg_type, vector_size, window, neg_words, min_count, sample, epochs):
    cores = multiprocessing.cpu_count()  # save number of available CPUs (threads)
    model_dbow = Doc2Vec(dm=alg_type,  # use distributed bag of words (PV-DBOW)
                         vector_size=vector_size,  # set dimensionality of feature vectors
                         window=window,  # set max distance between the current and predicted word within a sentence
                         hs=0,  # flag used to enable negative sampling
                         negative=neg_words,  # specify how many "noise" words to draw
                         min_count=min_count,  # ignore all words with total frequency lower than this
                         sample=sample,  # threshold for configuring which higher-frequency words are randomly downsampled
                         workers=cores,  # use these many worker threads to train the model faster
                         epochs=epochs)  # number of iterations (epochs) over the corpus
    return model_dbow

def create_and_train_doc2vec(doc2vec_type, vs, train_tagged):
    model_dbow = create_doc2vec_model(alg_type=doc2vec_type,  # choose whether to use PV-DM or PV-DBOW
                                      vector_size=vs,  # set dimensionality of feature vectors
                                      window=5,  # set max distance between the current and predicted word within a sentence
                                      neg_words=5,  # specify how many "noise" words to draw
                                      min_count=2,  # ignore all words with total frequency lower than this
                                      sample=0,  # threshold for configuring which higher-frequency words are randomly downsampled
                                      epochs=100)  # number of iterations (epochs) over the corpus

    # build vocabulary from the sequence of train documents
    model_dbow.build_vocab([x for x in train_tagged.values])

    # train Doc2Vec model
    model_dbow.train(shuffle([x for x in train_tagged.values]),
                     total_examples=len(train_tagged.values),  # count of documents
                     epochs=model_dbow.epochs)  # use number of epochs specified when creating the model

    return model_dbow

def vec_for_learning(model, tagged_docs):
    documents = tagged_docs.values  # type: np.ndarray
    doc_2_embedding_mapping = {}  # type: dict
    targets_list = []
    regressors_list = []

    # infer vector representation from trained model
    for doc in documents:
        targets_list.append(doc.tags[0])
        embedding_vector = model.infer_vector(doc.words, steps=20)  # type: np.ndarray
        regressors_list.append(embedding_vector)
        doc_as_string = " ".join(item for item in doc.words)
        doc_2_embedding_mapping[doc_as_string] = embedding_vector

    targets_np = np.asarray(targets_list)
    regressors_np = np.asarray(regressors_list)
    return regressors_np, targets_np, doc_2_embedding_mapping

def pipeline_d2v(test_tagged, vectorizer, random_forest):
    x_test_embedded, _, _ = vec_for_learning(vectorizer, test_tagged)
    return random_forest.predict_proba(x_test_embedded)

def extract_lime_explanation_d2v(idx_doc_to_investigate, vectorizer, random_forest, x_test, test_tagged):
    class_names = ["stable", "unstable"]
    explainer = LimeTextExplainer(class_names=class_names)
    c = pipeline_d2v(test_tagged, vectorizer, random_forest)
    exp = explainer.explain_instance(x_test[idx_doc_to_investigate], c, num_features=6)
    explanation_list = exp.as_list()
    return explanation_list

# ----------------------------------------- BEGIN SCRIPT -----------------------------------------
# a bunch of stuff ...
model_d2v = create_and_train_doc2vec(doc2vec_types, vector_sizes, ext_train_tagged)
random_forest = RandomForestClassifier(n_estimators=501, max_features=max_features_random_forest)  # define classifier
random_forest.fit(x_ext_train_embedded, y_ext_train)  # train
explanation_list_fn = extract_lime_explanation_d2v(one_idx, model_d2v, random_forest, x_test, test_tagged)

where x_test is a list with the documents and test_tagged is a gensim TaggedDocument

Thanks in advance!

marcotcr commented 3 years ago

The second argument to explain_instance is a prediction function that takes as input a list of strings and outputs a list of prediction probabilities. Your pipeline_d2v returns a list of predictions.

tommydino93 commented 3 years ago

Hi Marco, Thanks for your reply! Would it work if I turn pipeline_d2v into a class and create a predict_proba method? Or do you think there is a faster workaround? Thanks again

marcotcr commented 3 years ago

You don't need a class, you can just create a function. See #172, #200 for examples with other models.

tommydino93 commented 3 years ago

Hi Marco, Thanks again for your reply and for the examples. I followed #172 and modified pipeline_d2v into:

def pipeline_d2v(x_test_list_of_strings, y_test, model_d2v, random_forest):
    x_test = [text.split() for text in x_test_list_of_strings]
    test_data = pd.DataFrame({'reports': x_test, 'global_labels': y_test})
    test_tagged = test_data.apply(lambda r: TaggedDocument(words=r['reports'], tags=[r.global_labels]), axis=1)
    x_test_embedded = vec_for_learning_no_labels(model_d2v, test_tagged)
    return random_forest.predict_proba(x_test_embedded)

def extract_lime_explanation_d2v(idx_doc_to_investigate, vectorizer, random_forest, x_test, y_test, out_dir, cnt_document, prediction, embedding, save=True):
    class_names = ["stable", "unstable"]
    explainer = LimeTextExplainer(class_names=class_names)
    x_test_list_of_strings = [' '.join(x) for x in x_test]
    c = pipeline_d2v(x_test_list_of_strings, y_test, vectorizer, random_forest)
    exp = explainer.explain_instance(x_test[idx_doc_to_investigate], c, num_features=6)

Now pipeline_d2v takes as input x_test_list_of_strings and outputs prediction probabilities (variable c) like:

However, line

exp = explainer.explain_instance(x_test[idx_doc_to_investigate], c, num_features=6)

still gives me the error

File "/home/newuser/PycharmProjects/Medical_Reports/utils.py", line 1237, in extract_lime_explanation_d2v
    exp = explainer.explain_instance(x_test[idx_doc_to_investigate], c, num_features=6)
  File "/home/newuser/PycharmProjects/Medical_Reports/venv3/lib/python3.6/site-packages/lime/lime_text.py", line 411, in explain_instance
    mask_string=self.mask_string))
  File "/home/newuser/PycharmProjects/Medical_Reports/venv3/lib/python3.6/site-packages/lime/lime_text.py", line 114, in __init__
    self.as_list = [s for s in splitter.split(self.raw) if s]
TypeError: expected string or bytes-like object

What could be the problem? Maybe the additional inputs to pipeline_d2v? But I need those to embed the documents.

Thank you very much again for your time

marcotcr commented 3 years ago

Make sure type(x_test[idx_doc_to_investigate]) is str
As I said above, the second argument to explain_instance is a prediction function that takes as input a list of strings and outputs a list of prediction probabilities. Your pipeline_d2v still returns a list of prediction probabilities which you are storing in variable c, and passing c as argument. You have to call explain_instance with pipeline_d2v as a second argument
You will need to wrap the other arguments of pipeline_d2v. Check out partial, it's an easy way to wrap the other arguments.

marcotcr / lime

make_pipeline with gensim's doc2vec #616