gregversteeg / corex_topic

Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx
Apache License 2.0
626 stars 119 forks source link

Priority is always given to the first anchor from anchor words #49

Open ElizaLo opened 3 years ago

ElizaLo commented 3 years ago

I have a dataset that consists of 10 thousand documents. It definitely contains documents for 16 topics. With anchor words, I want to classify a dataset into 16 topics. For each topic, I set anchor words (some anchors have more words, some less, but on average about 50 words per topic). For each topic anchor words are set in a separate list, then I check for the presence of anchor words in the texts and add them to the general list of lists anchors.

But at the output, one topic always dominates (90-95%) in my documents, and this is the topic whose words are set first in the anchor words (I checked this by changing the order of the anchor words).

For example, I have a desserts and alcoholic drinks theme. If I put the anchor words of the theme desserts first in the list of anchor words, then this theme will prevail in the output. If I first put the anchor words of the topic of alcoholic beverages, then the topic of alcoholic beverages will prevail.

To prevail this means that 90% or more of the documents are marked with the first topic of the anchor words. Other of the 16 topics also appear in the output, but much less often and also wrong.

Can you please tell me why this is happening and what am I doing possibly wrong?

Thank you in advance for your help and answer!

ryanjgallagher commented 3 years ago

CorEx is a bit different in LDA in the sense that the topic probabilities don't have to add up to 1. So it could be that 90% of your documents express the desserts topic, but also 90% of your topics express the drinks topic. What do those proportions look like each time you switch the order of the anchors? I think you could check by doing something like this, if you're not already:

n_docs = topic_model.labels.shape[0]
topic_proportions = np.sum(topic_model.labels, axis=0) / n_docs

Some other thoughts that might help