maximtrp / tmplot

Visualization of Topic Modeling Results
https://tmplot.readthedocs.org
MIT License
22 stars 1 forks source link

Incorrect implementation for `get_relevant_terms` in `tmplot` #10

Closed ed9w2in6 closed 9 months ago

ed9w2in6 commented 9 months ago

Thanks for your work on this nice package.

However, the current implementation of get_relevant_terms is incorrect.

The LDAvis paper reports around 0.6 being the ideal value for the visualisation parameter for the relevancy score. However in tmplot I found that I have to also manually tune to near 0.99 to 0.9999 for a good balance of the high frequency terms and high lift (p(w|t)/p(w)) terms. I observe that from 0 to 0.9 there are almost no effect the the ranking. I do NOT observe this behaviour in pyLDAvis.

I think it is clear now that this is scaling issue and indeed in the original paper the formula for the relevancy score is defined with log probabilities instead of probabilities as in tmplot. Which explains the strange behaviour of having the responsive range of (1 - 1e-2 to 1 - 1e-4)

Here is a reference implementation from pyLDAvis in which unfortunately their visualisation showed a incorrect definition in the footnote, despite their implementation being correct.

I had open an issue for this: https://github.com/bmabey/pyLDAvis/issues/261

maximtrp commented 9 months ago

Hello! Thank you for reporting this. Indeed, I have not rechecked the formula. Please test the new version