Topic in document with 0.99 prob but no one word intersects between documents and topic

vladradishevsky commented 4 years ago

Hello

I have 200k documents and I create 100 topics. I look at the terms and see that the topics are good. But when I want to look at examples for each topic I do probs, _ = topic_model.transform(count_matrix, details=True). Then I create new column for each for example dataframe['topic=0']=pd.Series(probs[:, 0]). Then I sort dataframe by prob value decrease and I see that about 1/3 of the document is relevant to the topic but others are irrelevant. Moreover no one word intersects between documents and topic. No indication of similarity between documents and topic.

I noticed that last ~10 topics have few words (3-8) in get_topics method result, random words and prob values ~ 0.2-0.3 which is above average

Could you advise me how I can change the model, in particular, recalculation of probability estimates document-topic ? Ty

ryanjgallagher commented 4 years ago

@gregversteeg Could this have to do with certain words not appearing in a topic contributing to that topic having a higher probability for a document?

gregversteeg commented 4 years ago

Am I understanding correctly: the issue is that some documents have high probability for a topic, but the top words for that topic don't appear in the document?
Ryan's point is a possibility, but I think a more likely culprit is document length. This model is fundamentally binary; it doesn't use counts it just binarizes them so any number of counts greater than 0 gets a 1. This may be an issue if you have documents that have very different lengths. A long document will get a lot more "1"'s than a short one. You could check if your example involves the longer documents. Another check would be to use only the first K words of each document, re-train and see if results are very different. If this is the issue, you could break up long documents into sub-documents.

gregversteeg / corex_topic

Topic in document with 0.99 prob but no one word intersects between documents and topic #33