sparklyr / sparklyr

R interface for Apache Spark
https://spark.rstudio.com/
Apache License 2.0
953 stars 309 forks source link

how to interpret the output of ml_lda() ? #2556

Open mripley1991 opened 4 years ago

mripley1991 commented 4 years ago

How should the "beta" output of ml_lda() be interpreted? Most guides for how to understand Latent Dirichlet Allocation indicate that this output should give the word-distribution for each Topic as a list of probabilities, but when I perform the algorithm I get outputs outside the range [0,1].

Is the "beta" a transformation of these probabilities? If so could this be clarified in the documentation for the ml_lda() method?

Example code:

# load relevant libraries and establish spark connection
library(dplyr)      
library(sparklyr)
library(magrittr)

sc <- spark_connect(master = "local")

# define document corpus and convert into spark
text_in <- data.frame(
  text = c(
    "i am stuck with using sparklyr",
    "please can you help with my coding problem",
    "what does beta mean how should I interpret it"
  )
)
text_spark <- copy_to(sc, text_in, overwrite = TRUE)

# apply LDA method for 2 topics
set.seed(1234)
lda_model <- text_spark %>%
  ml_lda(~text, k = 2)

# view outputs using tidy() and arrange by term
lda_model %>% 
  tidy() %>%
  arrange(term)

Output (first six rows):

# A tibble: 20 x 3
   topic term       beta
   <int> <chr>     <dbl>
 1     0 beta      0.895
 2     1 beta      1.09 
 3     0 coding    0.871
 4     1 coding    0.982
 5     0 help      1.14 
 6     1 help      1.09 
yitao-li commented 4 years ago

@mripley1991 Based on https://spark.apache.org/docs/2.1.1/api/scala/index.html#org.apache.spark.mllib.clustering.LDAModel I believe the beta values are known as "concentration parameters" that parameterizes the Dirichlet per-topic word distribution for each topic and can be any positive real number (see https://en.wikipedia.org/wiki/Dirichlet_distribution -- except for in the wiki page this vector of real numbers determining the shape of the distribution is called alpha; also, that wiki page contains a helpful visualization of how different values influence the shape of the distribution)

Also, it's the same as beta as described in https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation#Model

mripley1991 commented 4 years ago

Thank you for the information! That's really helpful.

Would it be feasible to add something like a probabilities = TRUE parameter into ml_lda() that would convert the betas from Dirichlet parameters to true probabilities? This would mean that users could generate outputs that line up with other LDA packages and can be more easily interpreted.