How `textmodel_lda` predicts topic for unseen documents?

koheiw / seededlda

LDA for semisupervised topic modeling

https://koheiw.github.io/seededlda/

73 stars 16 forks source link

How `textmodel_lda` predicts topic for unseen documents? #82

Open zhezhaozz opened 1 month ago

zhezhaozz commented 1 month ago

Hi, thank you so much for developing this wonderful package. I have a conceptual question regarding how textmodel_lda predicts topic for unseen documents. I know this can be achieved by specifying the model argument in the function, but I wish to understand how it works in the background. Specifically,

I saw that the Gibbs sampling was running. Does it mean that the posterior word-topic distribution is getting updated?
How the model handle the unseen words?

Sorry if above questions are too basic and thank you for taking time reading this post.

Zhe

koheiw commented 3 weeks ago

In the example below, dfmt2 can have words that are absent in dfmt1, but only those in dfmt1 are used. If you compare lda1$phi and lda2$phi, you can find that the word-topic distribution is updated as a result of Gibbs sampling on dfmt2.

lda1 <- textmodel_lda(dfmt1, k = 5)
lda2 <- textmodel_lda(dfmt2, model = lda1)

zhezhaozz commented 3 weeks ago

Thank you so much for your response.

If I want to use seeded LDA for document classification, when predicting labels for unseen documents, should I just use lda1$phi as it is? For example,

dfmt2 %*% lda1$phi

and then assign labels based on the topic distribution calculated for each document.

I think I'm confused about why updating phi is needed. Your help is much appreciated.

koheiw commented 3 weeks ago

Since lda1$phi does not contain some of the words in dfmt2, it will simply error.

I tentatively added the add_terms argument in the dev-new-words branch. I cannot promise that I will merged the branch to master but you can install and try. Let me know how it works.

devtools::install_github("koheiw/seededlda", ref = "dev-new-words")

koheiw commented 3 weeks ago

Please try update_model = TRUE.

require(quanteda)
require(seededlda)

options(seededlda_residual_name = "other")

toks <- tokens(data_corpus_moviereviews[1:500],
               remove_punct = TRUE,
               remove_symbols = TRUE,
               remove_number = TRUE)
dfmt <- dfm(toks) %>%
    dfm_remove(stopwords(), min_nchar = 2) %>%
    dfm_trim(max_docfreq = 0.1, docfreq_type = "prop")

dfmt_train <- dfm_trim(head(dfmt, 450))
dfmt_test <- dfm_trim(tail(dfmt, 50))

lda <- textmodel_lda(dfmt_train, k = 5)
lda2 <- textmodel_lda(dfmt_test, model = lda, update_model = TRUE)
#> Warning: k, alpha, beta and gamma values are overwritten by the fitted model

identical(colnames(lda$phi), colnames(lda2$phi)) # different feature sets
#> [1] FALSE

identical(docnames(dfmt_test), rownames(lda2$theta)) # trained on dfmt_test
#> [1] TRUE