gregversteeg / corex_topic

Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx
Apache License 2.0
627 stars 120 forks source link

How to retrieve documents according to their topic #8

Closed dongqing7 closed 7 years ago

dongqing7 commented 7 years ago

Hi, Greg, after successfully fitting the models, how should I retrieve all the documents according the topic?

ryanjgallagher commented 7 years ago

There's a couple different ways you can get the documents for each topic. You could use the p_y_given_x attribute or log_p_y_given_x attributes to rank which documents are most probable for each topic. You could also get a binary classification of each document in each topic from labels (which applies a softmax from p_y_given_x).

You can also use log_z to rank which documents are "explained" the most by each topic according to pointwise total correlation. If you're looking something simple tough labels or p_y_given_x will probably be enough. Note, CorEx is a discriminative model, which means that CorEx estimates the probability a document belongs to a topic separately for each topic and the probabilities don't have to add up to 1.

dongqing7 commented 7 years ago

That's fantastic! Thank you!