Open zhezhaozz opened 1 month ago
In the example below, dfmt2
can have words that are absent in dfmt1
, but only those in dfmt1
are used. If you compare lda1$phi
and lda2$phi
, you can find that the word-topic distribution is updated as a result of Gibbs sampling on dfmt2
.
lda1 <- textmodel_lda(dfmt1, k = 5)
lda2 <- textmodel_lda(dfmt2, model = lda1)
Thank you so much for your response.
If I want to use seeded LDA for document classification, when predicting labels for unseen documents, should I just use lda1$phi
as it is? For example,
dfmt2 %*% lda1$phi
and then assign labels based on the topic distribution calculated for each document.
I think I'm confused about why updating phi is needed. Your help is much appreciated.
Since lda1$phi
does not contain some of the words in dfmt2, it will simply error.
I tentatively added the add_terms
argument in the dev-new-words
branch. I cannot promise that I will merged the branch to master but you can install and try. Let me know how it works.
devtools::install_github("koheiw/seededlda", ref = "dev-new-words")
Please try update_model = TRUE
.
require(quanteda)
require(seededlda)
options(seededlda_residual_name = "other")
toks <- tokens(data_corpus_moviereviews[1:500],
remove_punct = TRUE,
remove_symbols = TRUE,
remove_number = TRUE)
dfmt <- dfm(toks) %>%
dfm_remove(stopwords(), min_nchar = 2) %>%
dfm_trim(max_docfreq = 0.1, docfreq_type = "prop")
dfmt_train <- dfm_trim(head(dfmt, 450))
dfmt_test <- dfm_trim(tail(dfmt, 50))
lda <- textmodel_lda(dfmt_train, k = 5)
lda2 <- textmodel_lda(dfmt_test, model = lda, update_model = TRUE)
#> Warning: k, alpha, beta and gamma values are overwritten by the fitted model
identical(colnames(lda$phi), colnames(lda2$phi)) # different feature sets
#> [1] FALSE
identical(docnames(dfmt_test), rownames(lda2$theta)) # trained on dfmt_test
#> [1] TRUE
Hi, thank you so much for developing this wonderful package. I have a conceptual question regarding how
textmodel_lda
predicts topic for unseen documents. I know this can be achieved by specifying themodel
argument in the function, but I wish to understand how it works in the background. Specifically,Sorry if above questions are too basic and thank you for taking time reading this post.
Zhe