koheiw / seededlda

LDA for semisupervised topic modeling
https://koheiw.github.io/seededlda/
73 stars 16 forks source link

Compatibility with LDAvis #4

Closed JBGruber closed 3 years ago

JBGruber commented 3 years ago

Hi @koheiw,

I was giving the package a go and really like how you implemented things so far.

I noticed a strange issue when trying to use LDAvis though:

library(quanteda)
library(seededlda)
library(LDAvis)

data("data_corpus_moviereviews", package = "quanteda.textmodels")
corp <- head(data_corpus_moviereviews, 500)
dfmt <- dfm(corp, remove_number = TRUE) %>%
  dfm_remove(stopwords('en'), min_nchar = 2) %>%
  dfm_trim(min_termfreq = 0.90, termfreq_type = "quantile",
           max_docfreq = 0.1, docfreq_type = "prop")

# unsupervised LDA
lda <- textmodel_lda(dfmt, 6)

# use LDAvis to explore topics
json <- createJSON(phi = lda$phi,
                   theta = lda$theta, 
                   doc.length = quanteda::ntoken(lda$x),
                   vocab = quanteda::featnames(lda$x), 
                   term.frequency = quanteda::featfreq(lda$x))
#> Error in createJSON(phi = lda$phi, theta = lda$theta, doc.length = quanteda::ntoken(lda$x), : Rows of phi don't all sum to 1.
serVis(json)

This error occurs every time it seems. And it is telling the truth:

rowSums(lda$phi)
#>   topic1   topic2   topic3   topic4   topic5   topic6 
#> 1.000000 1.000000 1.000000 1.000000 1.000000 1.000136

I wondered if this is on purpose or a bug. My understanding is that it makes sense that phi rows add up to one. Yet the sum of the last topic's row is always a little over 1.

koheiw commented 3 years ago

Thanks @JBGruber. I can confirm that there was a bug. It is great to make LDAvis works with this package!

koheiw commented 3 years ago

Please try the patched master.

JBGruber commented 3 years ago

Works great now, thanks! I was more worried that phi doesn't add up to 1 than about LDAvis itself. Great that you solved both.