koheiw / seededlda

LDA for semisupervised topic modeling
https://koheiw.github.io/seededlda/
73 stars 16 forks source link

Add goodness of fit metrics #26

Open contefranz opened 2 years ago

contefranz commented 2 years ago

I would like to know if there is any implementation of the standard goodness-of-fit metrics for your textmodel_lda class. For instance, here is a SO post which didn't get much attraction. I am wondering if in the case of seeded-LDA, standard metrics still apply.

Could you please give me some information about any upcoming implementation, if any? Or, could you please suggest a direct method that applies to your object class? Thanks!

koheiw commented 2 years ago

Thank you for the post.

I did not think users of seeded LDA should worry about model fit because the number of topics is theoretically determined, but they might need a way to determine k for unseeded LDA. I have to do research on how to compute perprexity but divergence measure is straight forward as below. According to the statistic, k should be around 10.

require(seededlda)
require(quanteda)
require(Matrix)

data("data_corpus_moviereviews", package = "quanteda.textmodels")
corp <- head(data_corpus_moviereviews, 500)
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, remove_number = TRUE)
dfmt <- dfm(toks) %>%
    dfm_remove(stopwords('en'), min_nchar = 2) %>%
    dfm_trim(min_termfreq = 0.90, termfreq_type = "quantile",
             max_docfreq = 0.1, docfreq_type = "prop")

for (k in seq(5, 50, 5)) {
    lda <- textmodel_lda(head(dfmt, 450), k)
    div <- proxyC::dist(lda$phi, method = "kullback")
    diag(div) <- NA
    fit <- mean(div, na.rm = TRUE)
    cat(k, fit, "\n")
}
5 3.456055 
10 3.487706 
15 3.453193 
20 3.413442 
25 3.312763 
30 3.221957 
35 3.166234 
40 3.076886 
45 2.98464 
50 2.935299

Deveaud used the Jensen-Shannon divergence but I am using Kullback here, because proxyC does not have the measure (I probably should add).

I can make changea in the LDA functions to return the divergence measure if it is desired, but there is no guarantee that the topics are most meaningful when the statistic is the highest.

contefranz commented 2 years ago

Thanks for your answer. Here are few thoughts.

I did not think users of seeded LDA should worry about model fit because the number of topics is theoretically determined, but they might need a way to determine k for unseeded LDA.

I agree in principle. My line of thinking is as follows. Even if it is true that seeded LDA returns a pre-specified number of topics, this is extremely subjective due to me deciding what keywords should the model worry about. Let's assume instead that I come up with an evident misspecification of the topics. How would I know that phi and theta contains estimations that are by construction wrong?

For instance, can we use the residual estimation as a robustness check of the efficacy of the model?

Deveaud used the Jensen-Shannon divergence but I am using Kullback here, because proxyC does not have the measure (I probably should add).

That's absolutely fine. KL divergence works just ok.

I can make changea in the LDA functions to return the divergence measure if it is desired, but there is no guarantee that the topics are most meaningful when the statistic is the highest.

Totally agree on this one but again how would I know how coherent (i.e., interpretable) my topics are?

I am well aware of the problems around optimal topic identification in LDA. As a matter of fact, we implemented a method to assess the optimality which uses a simple chi-square test instead of adopting the perplexity index as a robust goodness of fit metric.

I guess the big question is whether or not we can compute any measure of likelihood out of your model. If the answer is no, then the approach is basically not testable in terms of its explanatory power. If the answer is yes, then the question is how can we compute it to come up with a similar metric w.r.t. the perplexity?

Sorry for being verbose but I think the concern is real and has to be addressed. Thoughts?

koheiw commented 2 years ago

Thank you the link to your project. I will read the paper.

I understand that users are always unsure about their choice of seed words, so I am willing to offer some indicator. The divergence measure is easy to add via a new function divergence() or something.

It is nice to offer the likelihood of parameters (e.g. perplexity), but we can do something similar by re-training an existing model on new data, and compare between the old and new models. If the old model has a good fit, the topic-word distribution should not change when trained on the new data.

In this example, when k = 10, the KL divergence between topics in the old model is higher, but it is smaller between the corresponding topics in the old and new models. These suggest that k = 10 is better than k = 20.

require(seededlda)
require(quanteda)
require(Matrix)

data("data_corpus_moviereviews", package = "quanteda.textmodels")
corp <- head(data_corpus_moviereviews, 500)
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, remove_number = TRUE)
dfmt <- dfm(toks) %>%
    dfm_remove(stopwords('en'), min_nchar = 2) %>%
    dfm_trim(min_termfreq = 0.90, termfreq_type = "quantile",
             max_docfreq = 0.1, docfreq_type = "prop")

divergence <- function(x) {
    div <- proxyC::dist(x$phi, method = "kullback")
    diag(div) <- NA
    mean(div, na.rm = TRUE)
}

lda10a <- textmodel_lda(head(dfmt, 450), 10)
lda20a <- textmodel_lda(head(dfmt, 450), 20)

divergence(lda10a)
#> [1] 3.538062
divergence(lda20a)
#> [1] 3.379323

lda10b <- textmodel_lda(tail(dfmt, 50), model = lda10a)
#> Warning: k, alpha and beta values are overwriten by the fitted model
lda20b <- textmodel_lda(tail(dfmt, 50), model = lda20a)
#> Warning: k, alpha and beta values are overwriten by the fitted model

mean(diag(proxyC::dist(lda10a$phi, lda10b$phi, method = "kullback", diag = TRUE)))
#> [1] 0.01251931
mean(diag(proxyC::dist(lda20a$phi, lda20b$phi, method = "kullback", diag = TRUE)))
#> [1] 0.01661152

What do you think?

contefranz commented 2 years ago

Thanks again for the support. I like the approach. Though, as we agreed, it isn't always the case that the better model happens to have a higher gof metric. If you could add a function like divergence() to the package that would be great of course.

At this point, my feeling is that, in the absence of a statistical test, we should include other metrics along the line of topic coherence or similar. topicmodels does not seem to ship that but I think it is one of the most reliable ways to assess the interpretability of the topics on top of their statistical sound. I am sure you are familiar with the metric, but here is the paper that introduced it.

The package topicdoc (which seems to be left behind in terms of development) does appear to have implemented a way to compute topic coherence on tm-class object estimated with topicmodels. The function is called topic_coherence().

Thoughts?

koheiw commented 2 years ago

I will add divergence(x) the package first. Then, I would also add coherence(). Whether the two measures agree with each other on optimal k is a question that we should study empirically.

contefranz commented 2 years ago

That sounds great, thank you very much!

Regarding the accordance of the two measures, well that's an interesting questions which demands for some answer.

koheiw commented 2 years ago

This is my cohesion function, but the statistic is only gets lower as k get higher...

coherence <- function(x, n = 10) {
    h <- apply(terms(x, n), 2, function(y) {
        d <- x$data[,y]
        e <- Matrix::Matrix(docfreq(d), nrow = nfeat(d), ncol = nfeat(d))
        f <- fcm(d, count = "boolean") + 1
        g <- Matrix::band(log(f / e), 1, ncol(f))
        sum(g)
    })
    sum(h)
}

for (k in seq(5, 50, 5)) {
    lda <- textmodel_lda(head(dfmt, 450), k)
    coh <- coherence(lda)
    cat(k, coh, "\n")
}
5 -433.7141 
10 -895.6645 
15 -1329.028 
20 -1746.271 
25 -2126.106 
30 -2503.134 
35 -2987.627 
40 -3312.912 
45 -3686.489 
50 -3957.752
contefranz commented 2 years ago

That's weird. Topic coherence should increase with k not the other way around. What am I missing here?

koheiw commented 2 years ago

I could be my bad, but I don't know what is wrong in my code.

masa126 commented 2 years ago

Hi, I'm following your codes in the above. But proxy responds an error as below;

k <- 5 lda <- textmodel_lda(head(dfmt, 450), k) div <- proxyC::dist(lda$phi, method = "kullback") Error in proxy(x, y, margin, method, p = p, smooth = smooth, drop0 = drop0, : x must be a sparseMatrix

Is there any conditions to use proxyC::dist? Thanks,

koheiw commented 2 years ago

You need the latest proxyC. Fixed via 8190c46.

masa126 commented 2 years ago

It works. Thank you.