piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.71k stars 4.38k forks source link

Average Topic Coherence over several values of topn #1281

Open tmylk opened 7 years ago

tmylk commented 7 years ago

"In contrast to the standard practice of using a fixed value of N (e.g. N = 5 or N = 10), our results suggest that calculating topic coherence over several different cardinalities and averaging results in a substantially more stable and robust evaluation"

See paper "The Sensitivity of Topic Coherence Evaluation to Topic Cardinality" and code by @jhlau

macks22 commented 7 years ago

With the recent improvements to coherence evaluation introduced by #1349, it should be straightforward to implement this. Simply start with the largest topn (e.g. 20 when using 20, 15, 10, and 5, as the paper did), then work down to the smallest. The computation for the largest one is guaranteed to have a set of relevant ids that is a superset of all the others, so the accumulated counts can be re-used across the other calculations.

To make this an option for CoherenceModel, one could allow an iterable of values for topn. Alternatively, it might make sense to only make this an option for the topic model top_topics method in conjunction with combining those calculations with the CoherenceModel calculations (#1128).