gregversteeg / corex_topic

Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx
Apache License 2.0
626 stars 119 forks source link

[Question] Anchoring multiple times #48

Closed pat266 closed 3 years ago

pat266 commented 3 years ago

In the example from the readme file, there are 3 different anchoring strategies. I'm interested in 2 of them, Anchoring single sets of words to multiple topics and Anchoring different sets of words to multiple topics. I'm wondering if I should combine two of the strategies together (or more) to get a better result. For example, using the example from the ReadMe file:

Anchor the specific list of words for every individual document

topic_model.fit(X, words=words, anchors=[['bernese', 'mountain', 'dog'], ['mountain', 'rocky', 'colorado']], anchor_strength=2)

Anchor general words throughout all of the documents

topic_model.fit(X, words=words, anchors=['protest', 'protest', 'protest', 'riot', 'riot', 'riot'], anchor_strength=2)

Will fitting the model with two different anchor words lists improve the result in general (or change anything at all), or will it decrease the quality of the result?

Also, does repeating the words in the anchor_words list change how the model view the words (increase its strength)? In the second code, the words 'protest' and 'riot' are repeated thrice.

ryanjgallagher commented 3 years ago

The use of anchor words and whether it increases the quality of your results depends on your data and what kind of results you are looking for. If you have two sets of anchor words that it makes sense to anchor because you want topics around those words / know there is a good reason that such topics should exist, then I would suggest trying it and comparing the results to those without the anchor words. If you're interested in the topics themselves, then you'll want to investigate what topics appear and don't appear with and without the anchor words. If you're using the topics as input features to some other model, then you'll want to see how that affects the quantitative output.

How much the results will change depends on how high you sent the anchor_strength. The anchor strength is how much weight to assign to the anchor words relative to all the other words. So for example an anchor_strength=2 means to give twice the weight to the anchor words compared to other words.

Yes, repeating the words in that example makes a difference. In that example protest is anchored to topics 1, 2, and 3, while riot is anchored to topics 4, 5, and 6. The model will find different topics for each of those. If you wanted to anchor multiple sets of words multiple times then you'd do something like

anchors=[['mountain', 'dog'], ['mountain', 'dog'], ['rocky', 'mountain'], ['rocky', 'mountain']]

pat266 commented 3 years ago

@ryanjgallagher If I have 4 topics and I only have 3 anchor words lists, can I leave the 4th anchor words list empty, or will it mess up the algorithm?

For example:

anchors=[['mountain', 'dog'], [''], ['rocky', 'mountain'], ['rocky', 'mountain']]

ryanjgallagher commented 3 years ago

You should just put lists for the topics you want to anchor. You shouldn't put anything for topics you don't anchor. So it would be like

anchors=[['mountain', 'dog'], ['rocky', 'mountain'], ['rocky', 'mountain']]

pat266 commented 3 years ago

@ryanjgallagher I guess a better way to ask the first question is that is there a way to assign different anchor_strength for different sets of words to the same topic? For example, if I have 2 topics total, can I do something like this?

topic_model.fit(X, words=words, anchors=[['bernese', 'mountain', 'dog'], ['mountain', 'rocky', 'colorado']], anchor_strength=4) topic_model.fit(X, words=words, anchors=[['protest'], ['riot']], anchor_strength=2)

Would this work?

ryanjgallagher commented 3 years ago

Currently, no, you can't do different anchor weights for different words right now. Pull request #40 proposes to add that feature, but we haven't had the capacity to verify it yet unfortunately.