Closed RMZ3 closed 3 years ago
Hi there,
It's a bit hard to say just from this output. Have you tried looking at what the topic words are? Are they reasonable?
It may be that using only 5 topics is finding broad themes that are common to many of the documents. I'd try using more topics to see if that helps sift the documents more.
You may also want to look at the attribute log_p_y_given_x
. If there's any numeric weirdness going on with exponentiation, then the log probabilities should be more stable.
Generally, I'd say this is a difference between LDA topic models and CorEx topic models. In LDA, every document must be in a topic, and this is represented as a probability distribution over topics for each document. In CorEx, it is possible for documents to have all topics or none. For instance, you may find that a blank document, or a very short document, is in zero topics. In LDA, that document would get a mixture over the prior probability of topics. If this isn't desirable, one thing that may help is breaking up documents into relatively similar lengths. If some documents are very short they will end up with no topics, and long ones will contain words from all topics. By breaking up documents into sub-documents of, say a few hundred words, you may get better behavior.
Hello,
I'm running into a problem where some documents are not being labeled to any topic and some documents are being labeled to all the topics. When I check the probabilities using p_y_given_x, I get this (where n_hidden = 5):
[[9.99999e-01 1.00000e-06 1.00000e-06 1.00000e-06 9.99999e-01] [1.00000e-06 1.00000e-06 1.00000e-06 1.00000e-06 1.00000e-06] [1.00000e-06 1.00000e-06 1.00000e-06 1.00000e-06 1.00000e-06] ... [1.00000e-06 9.99999e-01 1.00000e-06 9.99999e-01 1.00000e-06] [9.99999e-01 9.99999e-01 1.00000e-06 9.99999e-01 9.99999e-01] [9.99999e-01 9.99999e-01 9.99999e-01 9.99999e-01 9.99999e-01]]
Any idea as to why this is happening?