Open mripley1991 opened 4 years ago
@mripley1991 Based on https://spark.apache.org/docs/2.1.1/api/scala/index.html#org.apache.spark.mllib.clustering.LDAModel I believe the beta values are known as "concentration parameters" that parameterizes the Dirichlet per-topic word distribution for each topic and can be any positive real number (see https://en.wikipedia.org/wiki/Dirichlet_distribution -- except for in the wiki page this vector of real numbers determining the shape of the distribution is called alpha; also, that wiki page contains a helpful visualization of how different values influence the shape of the distribution)
Also, it's the same as beta as described in https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation#Model
Thank you for the information! That's really helpful.
Would it be feasible to add something like a probabilities = TRUE
parameter into ml_lda()
that would convert the betas from Dirichlet parameters to true probabilities? This would mean that users could generate outputs that line up with other LDA packages and can be more easily interpreted.
How should the "
beta
" output ofml_lda()
be interpreted? Most guides for how to understand Latent Dirichlet Allocation indicate that this output should give the word-distribution for each Topic as a list of probabilities, but when I perform the algorithm I get outputs outside the range [0,1].Is the "
beta
" a transformation of these probabilities? If so could this be clarified in the documentation for theml_lda()
method?Example code:
Output (first six rows):