gregversteeg / corex_topic

Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx
Apache License 2.0
626 stars 119 forks source link

Topic in document with 0.99 prob but no one word intersects between documents and topic #33

Closed vladradishevsky closed 3 years ago

vladradishevsky commented 4 years ago

Hello

I have 200k documents and I create 100 topics. I look at the terms and see that the topics are good. But when I want to look at examples for each topic I do probs, _ = topic_model.transform(count_matrix, details=True). Then I create new column for each for example dataframe['topic=0']=pd.Series(probs[:, 0]). Then I sort dataframe by prob value decrease and I see that about 1/3 of the document is relevant to the topic but others are irrelevant. Moreover no one word intersects between documents and topic. No indication of similarity between documents and topic.

I noticed that last ~10 topics have few words (3-8) in get_topics method result, random words and prob values ~ 0.2-0.3 which is above average

Could you advise me how I can change the model, in particular, recalculation of probability estimates document-topic ? Ty

ryanjgallagher commented 4 years ago

@gregversteeg Could this have to do with certain words not appearing in a topic contributing to that topic having a higher probability for a document?

gregversteeg commented 4 years ago

Am I understanding correctly: the issue is that some documents have high probability for a topic, but the top words for that topic don't appear in the document?
Ryan's point is a possibility, but I think a more likely culprit is document length. This model is fundamentally binary; it doesn't use counts it just binarizes them so any number of counts greater than 0 gets a 1. This may be an issue if you have documents that have very different lengths. A long document will get a lot more "1"'s than a short one. You could check if your example involves the longer documents. Another check would be to use only the first K words of each document, re-train and see if results are very different. If this is the issue, you could break up long documents into sub-documents.