koheiw / seededlda

LDA for semisupervised topic modeling
https://koheiw.github.io/seededlda/
73 stars 16 forks source link

using predict() #19

Closed erdnaxel closed 3 years ago

erdnaxel commented 3 years ago

I tried predict() and it seems to work, but then I can't seem to get the predictions to re-attach in the right order. I'm not super advanced and I'm probably missing something obvious.

new_corpus <- corpus(new_data, text_field = "text")

dfmt <- dfm(new_corpus, remove_number = TRUE) %>% dfm_remove(stopwords('en'), min_nchar = 2) %>% dfm_trim(min_termfreq = 0.70, termfreq_type = "quantile", max_docfreq = 0.1, docfreq_type = "prop") dfmt <- dfm_subset(dfmt, ntoken(dfmt) > 1)

predictions <- predict(slda, newdata = dfmt, max_iter = 2000, verbose = quanteda_options("verbose"))

new_data_from_dfmt <- docvars(dfmt)

new_data_from_dfmt$likely_topic <- predictions

The predictions are pretty much random, so I think the order has been lost in the shuffle... I don't know how I'm supposed to do it properly.

(I've been using the model a lot and it's great, but I re-run it every time I want to apply it to new data.)

koheiw commented 3 years ago

It could be because of the limitation of the algorithm. I can investigate if you supply replicable code and data.

erdnaxel commented 3 years ago

Thanks for the response. I created a replicable example, but it ended it up working! I then went back to the original problem and it worked there as well. 💯

In this code, I create a model to predict if a bit of text is in English or French. It works very well.

`

library(gutenbergr) text1 <- gutenberg_download(c(15)) ## english text2 <- gutenberg_download(c(41211)) ## french Encoding(text2$text) <- "latin1" texts <- rbind(text1, text2)

library(tidyverse) library(magrittr) texts %<>%
filter(text != "")

library(seededlda) library(quanteda) library(quanteda.textmodels) quanteda_options(threads = 7) library(topicmodels)

text_corpus <- corpus(texts, text_field = "text")

dfmt <- dfm(text_corpus, remove_number = TRUE) %>% dfm_remove(stopwords('en'), min_nchar = 2) %>% dfm_trim(min_termfreq = 0.70, termfreq_type = "quantile", max_docfreq = 0.1, docfreq_type = "prop") dfmt <- dfm_subset(dfmt, ntoken(dfmt) > 1)

dict <- dictionary(list(english = c("the", "of", "many", "am", "i", "you", "we", "they", "will", "be"), french = c("le", "la", "les", "suis", "je", "vous", "tu", "ils", "elles")))

create model

set.seed(123456789) slda <- textmodel_seededlda(dfmt, dict, max_iter = 5000, alpha = 0.0001, beta = 0.0001, residual = F, verbose = quanteda_options("verbose"))

find most likely topic for text

text_topic <- seededlda::topics(slda) texts_topic <- docvars(dfmt) texts_topic$likely_topic <- text_topic

results are pretty much perfect

predict on new data

text3 <- gutenberg_download(c(35)) ## english text4 <- gutenberg_download(c(51381)) ## french

new_data <- rbind(text3, text4) %>% filter(text != "")

create a corpus for the new data

new_corpus <- corpus(new_data, text_field = "text")

dfmt <- dfm(new_corpus, remove_number = TRUE) %>% dfm_remove(stopwords('en'), min_nchar = 2) %>% dfm_trim(min_termfreq = 0.70, termfreq_type = "quantile", max_docfreq = 0.1, docfreq_type = "prop")

make predictions

predictions <- predict(slda, newdata = dfmt, max_iter = 2000, verbose = quanteda_options("verbose"))

create final table

new_data_from_dfmt <- docvars(dfmt) new_data_from_dfmt$likely_topic <- predictions new_data_with_pred <- cbind(new_data, new_data_from_dfmt)

results are very accurate

`