gregversteeg / corex_topic

Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx
Apache License 2.0
627 stars 120 forks source link

Metrics for Model Selection #51

Open mchabala opened 3 years ago

mchabala commented 3 years ago

Hi,

I'm testing some semi-supervised models, each with 20 topics created through lists of roughly 15 anchor words per topic. The documents within the corpus I'm working with have a large variance in word length (150 - 20,000+). I've broken the documents into smaller batches to help control for document length, and am looking to find the batch size and anchor strength which creates the best model.

I know that total correlation is the measure which CorEx maximizes when constructing the topic model, but in my experimenting with anchor strength I've found that TC always increased linearly with anchor strength, even when it's set into the thousands. So far I've been evaluating my models by comparing the anchor words of each topic to the words returned from .get_topics(), and I was wondering if there is a more quantitative way of selecting one model over another? I've looked into using other packages to measure the sematic similarity between the anchor words and the different words retrieved by .get_topics(), but wanted to reach out to see if there's any other metrics out there to measure model performance.

Additionally, besides batch size and anchor strength, are there any other parameters I should be aware of when fitting a model? Any help would be greatly appreciated.

ryanjgallagher commented 3 years ago

For determining the "best" model, I think it depends on whether the topic model is being used as an interpretative / organizing tool, or as a set of features in some downstream class. If it's for a downstream class (e.g. document classification), then I think using a metric related to that task is going to be more useful than total correlation, since that's the outcome you're interested in anyways.

For tasks that require more intrinsic evaluation (e.g. interpreting topics, organizing documents into different clusters), then I think that TC (or any model evaluation metric) should be used as a rough guideline rather than a definitive arbiter of what the "best" topic model is, since there's no best topic model, there are only ones that are useful for what you want to do. In that case, I would use TC to get a sense of how many topics are most useful for modeling the documents. I wouldn't rely on it as much for setting the anchor strength because, as you've noticed, TC increases the more that the anchor strength increases.

Instead, I'd keep two things in mind for setting the anchor strength. The first is that the anchor strength represents the relative amount of weight that CorEx assigns to a word relative to other words. So if the anchor strength is 2, then CorEx gives twice the weight to that word relative to all other words. The second is that the higher the anchor strength, the less room the topic model has to find topics because you have already dictated what the topics should be. I would avoid setting anchor strength in the thousands, or even the hundreds, because that's likely going to force all of the anchor words to be the top topic words, and at that point you're probably better off with a keyword approach than a topic model. So I would check that the topics you're getting still have some flexibility to them after you set the anchor strength. If you're only seeing the anchor words as the top topic words, the anchor strength may be set too aggressively.

There are other quantitative measures of topic model evaluation, but they all assume different things about what it means to be a "good" model. The topic coherence is a classic metric at this point. Doing something with embeddings and semantic similarity could be useful too.

There's a max_iter parameter that controls how many iterations of the model to run. In my experience, the models seem to converge before reaching the default (200 iterations), so I wouldn't worry too much about it