Open tmylk opened 7 years ago
With the recent improvements to coherence evaluation introduced by #1349, it should be straightforward to implement this. Simply start with the largest topn
(e.g. 20 when using 20, 15, 10, and 5, as the paper did), then work down to the smallest. The computation for the largest one is guaranteed to have a set of relevant ids that is a superset of all the others, so the accumulated counts can be re-used across the other calculations.
To make this an option for CoherenceModel
, one could allow an iterable of values for topn
. Alternatively, it might make sense to only make this an option for the topic model top_topics
method in conjunction with combining those calculations with the CoherenceModel
calculations (#1128).
"In contrast to the standard practice of using a fixed value of N (e.g. N = 5 or N = 10), our results suggest that calculating topic coherence over several different cardinalities and averaging results in a substantially more stable and robust evaluation"
See paper "The Sensitivity of Topic Coherence Evaluation to Topic Cardinality" and code by @jhlau