gregversteeg / corex_topic

Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx
Apache License 2.0
626 stars 119 forks source link

Some documents not belonging to any topics #46

Closed RMZ3 closed 3 years ago

RMZ3 commented 3 years ago

Hello,

I'm running into a problem where some documents are not being labeled to any topic and some documents are being labeled to all the topics. When I check the probabilities using p_y_given_x, I get this (where n_hidden = 5):

[[9.99999e-01 1.00000e-06 1.00000e-06 1.00000e-06 9.99999e-01] [1.00000e-06 1.00000e-06 1.00000e-06 1.00000e-06 1.00000e-06] [1.00000e-06 1.00000e-06 1.00000e-06 1.00000e-06 1.00000e-06] ... [1.00000e-06 9.99999e-01 1.00000e-06 9.99999e-01 1.00000e-06] [9.99999e-01 9.99999e-01 1.00000e-06 9.99999e-01 9.99999e-01] [9.99999e-01 9.99999e-01 9.99999e-01 9.99999e-01 9.99999e-01]]

Any idea as to why this is happening?

ryanjgallagher commented 3 years ago

Hi there,

It's a bit hard to say just from this output. Have you tried looking at what the topic words are? Are they reasonable?

It may be that using only 5 topics is finding broad themes that are common to many of the documents. I'd try using more topics to see if that helps sift the documents more.

You may also want to look at the attribute log_p_y_given_x. If there's any numeric weirdness going on with exponentiation, then the log probabilities should be more stable.

gregversteeg commented 3 years ago

Generally, I'd say this is a difference between LDA topic models and CorEx topic models. In LDA, every document must be in a topic, and this is represented as a probability distribution over topics for each document. In CorEx, it is possible for documents to have all topics or none. For instance, you may find that a blank document, or a very short document, is in zero topics. In LDA, that document would get a mixture over the prior probability of topics. If this isn't desirable, one thing that may help is breaking up documents into relatively similar lengths. If some documents are very short they will end up with no topics, and long ones will contain words from all topics. By breaking up documents into sub-documents of, say a few hundred words, you may get better behavior.