vi3k6i5 / GuidedLDA

semi supervised guided topic model with custom guidedLDA
Mozilla Public License 2.0
499 stars 110 forks source link

Predict topics #14

Open berndartmueller opened 6 years ago

berndartmueller commented 6 years ago

Hello,

first of thanks for this great library! I managed to get the training working. But right now I'm struggling to predict the best matching topics for a given (single) document.

I already tried doc_topics = model.transform(Z), but how do I now get the probabilities for the (e.g. 7) topics?

Thanks!

YipingNUS commented 6 years ago

Hi @berndartmueller, the model.transform(X) method itself returns the probability distribution.

I added the following lines in the example code and it worked as expected:

print("\nPredicting topic for the first document")
doc_topic = model.transform(X[0,:])  # predict the labels the first document
print(doc_topic) 

out: [[3.97730781e-05 1.86927840e-01 2.05632359e-02 4.18205495e-03
  7.88287096e-01]]

As commented in the docstring of transform method,

        Returns
        -------
        doc_topic : array-like, shape (n_samples, n_topics)
            Point estimate of the document-topic distributions
Praveenrajan27 commented 5 years ago

Hi @berndartmueller , To predict topics for new documents we could use model.fit_transform(dtm) method. It worked when I used it to predict incoming documents based on the trained model

ImSajeed commented 5 years ago

Hi @vi3k6i5 @berndartmueller,@Praveenrajan27,@YipingNUS Can some one help me understand whether the new predict input text data(words) should already exist in the dictionary?

I'm using below code to convert gensim data to doc_term matrix

from gensim import matutils from gensim.matutils import corpus2csc

def bow_iterator(docs, dictionary): for doc in docs: yield dictionary.doc2bow(doc)

def get_term_matrix(msgs, dictionary): bow = bow_iterator(msgs, dictionary) X = np.transpose(matutils.corpus2csc(bow).astype(np.int64)) return X

X = get_term_matrix(bigram_train, train_id2word)

For predicting:

X_test = get_term_matrix([['new','travles','comfort']], train_id2word) y_pred = model.fit_transform(X_test)

while predicting for test input i'm getting error as x is not positive value

YipingNUS commented 5 years ago

@ImSajeed, yes you need to make sure you use the same vocab for training and prediction. In sklearn, that would correspond to fit_transform for training and transform for test/prediction.

https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

btw, I found GuidedLDA is good for inferring topics, but it does a poor job in classification. The following repo works much better. The downside is that it's much harder to set up. It took me two weeks to refactor it so that it can predict for new documents (the original repo requires all documents to be indexed in Lucene up-front).

https://github.com/WHUIR/STM

ImSajeed commented 5 years ago

Hi @vi3k6i5 @YipingNUS @Praveenrajan27 , could you please help on the below issue.

I'm facing issue while predicting the topics for new documents using y_pred = model.fit(X_test) or y_pred = model.fit_transform(X_test)

y_pred = model.fit(X_test) - giving irrelevant topics distribution

y_pred = model.fit_transform(X_test) - Not matching with correct existing topics

But the same model is predicting the right topics for the trained documents using y_pred = model.fit_transform(X_test) , but not working for new documents.

Please let me know the right way of predicting topics for new document.

code below

X_test = get_term_matrix([['blankets not','not clean']], train_id2word) y_pred = model.fit_transform(X_test)

YipingNUS commented 5 years ago

@ImSajeed, below is my code that worked. You should use transform instead.


def predict_prob(text):
    """ return the probability vector for the input text to belong to each of the topics
    """
    text_vec = tf_vectorizer.transform([text])
    doc_topic = seeded_model.transform(text_vec)
    return doc_topic
ImSajeed commented 5 years ago

Hi @vi3k6i5 @YipingNUS

Could you please let me know the importance of refresh param used in GuidedLDA