Closed christofkaelin closed 2 years ago
The description of get_top_topic_words()
is correct. It returns most probable words for each of the selected topics. You should have probably removed all those numeric terms at the preprocessing stage (and it is highly recommended) as they are meaningless.
I can bind tmplot.calc_terms_probs_ratio()
function to bitermplus
package, but tmplot
requires some refactoring. So, it will take time
I fitted a Biterm topic model based on my lemmas and sklearn's CountVectorizer. My dataset is about German reviews on TVs and washing machines.
Unfortunately, the
get_top_topic_words
yields unreasonable results:Thus, I used your tmplot package to see, whether I could reconstruct it: It turns out, that I get similar results with lambda=1 inside
tmp.report
. Using a lower value than that, results in more reasonable words.Trying to apply it directly, I played around with tmp's helper functions which resulted in this code:
I get the following output, which does make sense in the context of my project and is equal to the output of tmplot (where lambda < 1):
As per
get_top_topic_words
documentation it returns the words with highest probabilities in all selected topics. I am not sure what exactly I am missing out: Am I missing some mathematical context? Is there any possibility to extend this method to use custom lambda values?