Closed jonaschn closed 3 years ago
I agree that the message could be clearer, but this isn't correct. The Dirichlet over the doc-topic distributions can be seen as a distribution and a scale parameter. The normal hyperparameter optimization modifies all of them (# topics parameters), but this option only modifies the scale (1 parameter), keeping the distribution uniform.
@mimno I edited my stackoverflow answer. I am not sure if you read my first (completely wrong) answer.
Does Mallet optimizes the symmetric alpha prior (scale) if hyperparameter optimization is turned on with --optimize interval
and --use-symmetric-alpha
?
As you mentioned, the distribution stays the same (uniform) but at least in my experiments also the scale keeps unchanged.
Toy Example with 5 topics:
[parameters] --optimize interval 100 --alpha 0.5 --use-symmetric-alpha
[before optimization] alphaSum = 0.5 which leads to alpha = [0.1, 0.1, 0.1, 0.1, 0.1] and beta = 0.01 (by default)
I assumed that with --optimize interval
and --use-symmetric-alpha
only the symmetrical beta prior is optimized (scale only).
[my expectation] alphaSum = 0.5 which leads to alpha = [0.1, 0.1, 0.1, 0.1, 0.1] (not optimized) and beta = 0.0101 (optimized)
If I understood you correctly I am wrong and this is the reality: [reality?] alphaSum = 0.55 which leads to alpha = [0.11, 0.11, 0.11, 0.11, 0.11] (scale optimized, but still symmetric) and beta = 0.0101 (optimized, always symmetric)
What exactly do you mean with the "concentration parameter"? I wrongly assumed that you mean alpha (as a symmetrical distribution). This was confusing for me and the stackoverflow questioner.
@mimno Could please explain your following statement in a bit more detail (or provide any helpful reference)?
The Dirichlet over the doc-topic distributions [alpha] can be seen as a distribution and a scale parameter.
My understanding of alpha priors:
K := number of topics α is a K-vector of positive values
Example: K=10, alpha (cli parameter)=5.0 (default value)
For a symmetric α prior this results in vector of alpha_k = [alpha] / [num topics], i.e., [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]
This initial prior will not be optimized further if one uses --optimize interval
= 0 and --use-symmetric-alpha
(false or true).
However, if --optimize interval
> 0 and --use-symmetric-alpha
the doc-topic distribution still is not optimized but the symmetric β is optimized further.
Only if --optimize interval
> 0 (without passing --use-symmetric-alpha
) the doc-topic distribution α and β as well are optimized further, resulting in an asymmetric alpha prior and optimized beta prior.
Is the described behavior correct?
If this is correct, I don't understand how the concentration parameter of the prior over document-topic distributions α is optimized (as stated in the help text). What exactly do you mean by the concentration parameter in contrast to the scale?
@mimno I finally figured out what you mean by concentration parameter. This answer helped me to understand the difference between both parameterizations of Dirichlet distributions. These slides by Hanna Wallach provide an even better overview: https://people.cs.umass.edu/~wallach/talks/priors.pdf
I edited my Stackoverflow answer (again) and proposed a more informative help text. What do you think?
@mimno I force-pushed my branch. Now you should be able to squash and merge this PR. For some reason, your comment disappeared, probably because it referenced a non-existing commit (overwritten by force-push)
May I ask if you answer this question about the Hyperparameter optimization technique used in Mallet?
There is some irritation about the hyperparameter optimization of the Dirichlet priors. See https://stackoverflow.com/questions/52099379/mallet-hyperparameter-optimization