maximtrp / bitermplus

Biterm Topic Model (BTM): modeling topics in short texts
https://bitermplus.readthedocs.io/en/stable/
MIT License
77 stars 13 forks source link

get_top_topic_words yields unreasonable results #18

Closed christofkaelin closed 2 years ago

christofkaelin commented 2 years ago

I fitted a Biterm topic model based on my lemmas and sklearn's CountVectorizer. My dataset is about German reviews on TVs and washing machines.

Unfortunately, the get_top_topic_words yields unreasonable results: image

Thus, I used your tmplot package to see, whether I could reconstruct it: It turns out, that I get similar results with lambda=1 inside tmp.report. Using a lower value than that, results in more reasonable words.

Trying to apply it directly, I played around with tmp's helper functions which resulted in this code:

from tmplot._helpers import get_phi
from tmplot._helpers import calc_terms_probs_ratio
calc_terms_probs_ratio(get_phi(biterm_model),0)['Terms'].to_list()[:20]

I get the following output, which does make sense in the context of my project and is equal to the output of tmplot (where lambda < 1): image

As per get_top_topic_words documentation it returns the words with highest probabilities in all selected topics. I am not sure what exactly I am missing out: Am I missing some mathematical context? Is there any possibility to extend this method to use custom lambda values?

maximtrp commented 2 years ago

The description of get_top_topic_words() is correct. It returns most probable words for each of the selected topics. You should have probably removed all those numeric terms at the preprocessing stage (and it is highly recommended) as they are meaningless.

I can bind tmplot.calc_terms_probs_ratio() function to bitermplus package, but tmplot requires some refactoring. So, it will take time