koheiw / seededlda

LDA for semisupervised topic modeling
https://koheiw.github.io/seededlda/
73 stars 16 forks source link

how to predict on new data? #9

Closed erdnaxel closed 3 years ago

erdnaxel commented 3 years ago

hello:

love the package!!

i’m wondering how to apply the model to new data?

koheiw commented 3 years ago

Hi @erdnaxel

In the original GibbsLDA++, topics of unseed documents are inferred in another round of Gibbs sampling. I haven't implemented this function, because I didn't think many people separate fitting and prediction steps with LDA.

With the current version, you can still predict topics of unseen documents using the distribution of topic over words (phi). Here, x should be fitted LDA object, and newdata is a DFM.

predict <- function(x, newdata = NULL) {
    if (!is.null(x)) {
        data <- newdata
    } else {
        data <- x$data
    }
    data <- dfm_match(data, colnames(x$phi))
    temp <- data %*% t(x$phi)
    result <- factor(max.col(temp), labels = rownames(x$phi),
                     levels = seq_len(nrow(x$phi)))
    result[rowSums(data) == 0] <- NA
    return(result)
}

Please be aware that the result of predict() can be different from topics() due to the different nature of algorithm.

tomseinen commented 3 years ago

Came here for the same question as @erdnaxel. I think implementing the predict function will be much appreciated.

Great work!

erdnaxel commented 3 years ago

thank you, i really appreciate the response! i will try it out as soon as i can.

koheiw commented 3 years ago

Guys, I created predict() in the issue-9 branch. Please give it a try.

koheiw commented 3 years ago

I close this as the branch is merged, so please open a new issue if there are problems.