mimno / Mallet

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
https://mimno.github.io/Mallet/
Other
984 stars 344 forks source link

Fix misleading help text #188

Closed jonaschn closed 3 years ago

jonaschn commented 3 years ago

There is some irritation about the hyperparameter optimization of the Dirichlet priors. See https://stackoverflow.com/questions/52099379/mallet-hyperparameter-optimization

mimno commented 3 years ago

I agree that the message could be clearer, but this isn't correct. The Dirichlet over the doc-topic distributions can be seen as a distribution and a scale parameter. The normal hyperparameter optimization modifies all of them (# topics parameters), but this option only modifies the scale (1 parameter), keeping the distribution uniform.

jonaschn commented 3 years ago

@mimno I edited my stackoverflow answer. I am not sure if you read my first (completely wrong) answer.

Does Mallet optimizes the symmetric alpha prior (scale) if hyperparameter optimization is turned on with --optimize intervaland --use-symmetric-alpha? As you mentioned, the distribution stays the same (uniform) but at least in my experiments also the scale keeps unchanged.

Toy Example with 5 topics: [parameters] --optimize interval 100 --alpha 0.5 --use-symmetric-alpha [before optimization] alphaSum = 0.5 which leads to alpha = [0.1, 0.1, 0.1, 0.1, 0.1] and beta = 0.01 (by default) I assumed that with --optimize intervaland --use-symmetric-alpha only the symmetrical beta prior is optimized (scale only). [my expectation] alphaSum = 0.5 which leads to alpha = [0.1, 0.1, 0.1, 0.1, 0.1] (not optimized) and beta = 0.0101 (optimized)

If I understood you correctly I am wrong and this is the reality: [reality?] alphaSum = 0.55 which leads to alpha = [0.11, 0.11, 0.11, 0.11, 0.11] (scale optimized, but still symmetric) and beta = 0.0101 (optimized, always symmetric)

What exactly do you mean with the "concentration parameter"? I wrongly assumed that you mean alpha (as a symmetrical distribution). This was confusing for me and the stackoverflow questioner.

jonaschn commented 3 years ago

@mimno Could please explain your following statement in a bit more detail (or provide any helpful reference)?

The Dirichlet over the doc-topic distributions [alpha] can be seen as a distribution and a scale parameter.

My understanding of alpha priors:

K := number of topics α is a K-vector of positive values

Example: K=10, alpha (cli parameter)=5.0 (default value) For a symmetric α prior this results in vector of alpha_k = [alpha] / [num topics], i.e., [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5] This initial prior will not be optimized further if one uses --optimize interval = 0 and --use-symmetric-alpha (false or true).

However, if --optimize interval > 0 and --use-symmetric-alpha the doc-topic distribution still is not optimized but the symmetric β is optimized further.

Only if --optimize interval > 0 (without passing --use-symmetric-alpha) the doc-topic distribution α and β as well are optimized further, resulting in an asymmetric alpha prior and optimized beta prior.

Is the described behavior correct?

If this is correct, I don't understand how the concentration parameter of the prior over document-topic distributions α is optimized (as stated in the help text). What exactly do you mean by the concentration parameter in contrast to the scale?

jonaschn commented 3 years ago

@mimno I finally figured out what you mean by concentration parameter. This answer helped me to understand the difference between both parameterizations of Dirichlet distributions. These slides by Hanna Wallach provide an even better overview: https://people.cs.umass.edu/~wallach/talks/priors.pdf

jonaschn commented 3 years ago

I edited my Stackoverflow answer (again) and proposed a more informative help text. What do you think?

jonaschn commented 3 years ago

@mimno I force-pushed my branch. Now you should be able to squash and merge this PR. For some reason, your comment disappeared, probably because it referenced a non-existing commit (overwritten by force-push) image

May I ask if you answer this question about the Hyperparameter optimization technique used in Mallet?