However, the current implementation of get_relevant_terms is incorrect.
The LDAvis paper reports around 0.6 being the ideal value for the visualisation parameter for the relevancy score. However in tmplot I found that I have to also manually tune to near 0.99 to 0.9999 for a good balance of the high frequency terms and high lift (p(w|t)/p(w)) terms. I observe that from 0 to 0.9 there are almost no effect the the ranking.
I do NOT observe this behaviour in pyLDAvis.
I think it is clear now that this is scaling issue and indeed in the original paper the formula for the relevancy score is defined with log probabilities instead of probabilities as in tmplot. Which explains the strange behaviour of having the responsive range of (1 - 1e-2 to 1 - 1e-4)
Here is a reference implementation from pyLDAvis in which unfortunately their visualisation showed a incorrect definition in the footnote, despite their implementation being correct.
Thanks for your work on this nice package.
However, the current implementation of
get_relevant_terms
is incorrect.The LDAvis paper reports around 0.6 being the ideal value for the visualisation parameter for the relevancy score. However in
tmplot
I found that I have to also manually tune to near 0.99 to 0.9999 for a good balance of the high frequency terms and high lift (p(w|t)/p(w)
) terms. I observe that from 0 to 0.9 there are almost no effect the the ranking. I do NOT observe this behaviour in pyLDAvis.I think it is clear now that this is scaling issue and indeed in the original paper the formula for the relevancy score is defined with log probabilities instead of probabilities as in
tmplot
. Which explains the strange behaviour of having the responsive range of (1 - 1e-2 to 1 - 1e-4)Here is a reference implementation from pyLDAvis in which unfortunately their visualisation showed a incorrect definition in the footnote, despite their implementation being correct.
I had open an issue for this: https://github.com/bmabey/pyLDAvis/issues/261