mimno / Mallet

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
https://mimno.github.io/Mallet/
Other
990 stars 344 forks source link

Auto-correlation between samples (Binkley et al.) #201

Open jonaschn opened 3 years ago

jonaschn commented 3 years ago

I recently found this paper by Binkley et al.

A short extract from this paper follows:

If si is large enough, the observations are practically independent. However, too small a value risks unwanted correlation. To summarize the effect of b, n, and si: if any of these settings are too low, then the Gibbs sampler will produce inaccurate or inadequate information; if any of these settings are too high, then the only penalty is wasted computational effort. Unfortunately, as described in Section 6, support for extracting interval-separated observations is limited in existing LDA tools. For example, For example, Mallet provides this capability but appears to suffer from a local maxima problem

with a footnote linking to http://www.cs.loyola.edu/~binkley/topic_models/additional-images/mallet-fixation/

Does this problem still exist?

Reference: Binkley, D., Heinz, D., Lawrie, D., & Overfelt, J. (2014). Understanding LDA in source code analysis. 22nd International Conference on Program Comprehension, ICPC 2014 - Proceedings, 26–36. https://doi.org/10.1145/2597008.2597150