Closed theobayard closed 4 years ago
I think the reference here helps a bit: https://tmtoolkit.readthedocs.io/en/latest/topic_modeling.html Paper reference: Termite (https://dl.acm.org/doi/10.1145/2254556.2254572)
I assume this will live in LDAModel itself, but it should be straightforward to compute directly using LDAModel.wordTopicCounts and LDAModel.topicWordCounts.
I'm sure Xanda would encourage you to sink multiple days in KL divergence! I have some slides but they don't really make sense without narration.
Serendip uses a better metric called salience to highlight words. I was going to implement it right away, but it started talking about Kullback–Leibler divergence and I got scared/decided I didn't want to sink my whole day on it :( Right now, highlighting is based on the number of times the word is assigned to the selected topic divided by the total number of tokens assigned to that topic. Salience is better because it works against the bias towards words associated with many topics that is present in the current metric. This change can be implemented by changing the value that getWordTopicValue in DocView returns.