quanteda / quanteda.classifiers

quanteda textmodel extensions for classifying documents
21 stars 2 forks source link

Add textmodel maxent? #14

Open JBGruber opened 5 years ago

JBGruber commented 5 years ago

First of all: thanks for this great package! Since RTextTools was recently removed from CRAN I was trying to find a good solution for SML on text data in R and was a bit frustrated by caret which which is not optimized for text. So I think quanteda.classifiers fills a real gap in the R ecosystem right now.

One thing that I was missing so far though is an implementation of multinomial logistic regression, also known as maximum entropy. In my experience it often outperforms other algorithms (especially for multi class classification). Furthermore, the implementation by Timothy Jurka (Jurka 2012) already works pretty well with the ‘quantedaverse’. Here is a quick example with the movies corpus:

# remotes::install_github("quanteda/quanteda.classifiers") 
# tensorflow::install_tensorflow()
# install.packages("https://cran.r-project.org/src/contrib/Archive/maxent/maxent_1.3.3.1.tar.gz", type = "source", repos = NULL)
library(quanteda)
library(quanteda.corpora)
library(maxent)
corp <- data_corpus_movies
set.seed(300)
train_size <- length(docnames(corp)) * 0.9
id_train <- sample(docnames(corp), size = train_size, replace = FALSE)

# get training set
training_dfm <- corpus_subset(corp, docnames(corp) %in% id_train) %>%
  dfm(stem = TRUE)

# get test set (documents not in id_train, make features equal)
test_dfm <- corpus_subset(corp, !docnames(corp) %in% id_train) %>%
  dfm(stem = TRUE) %>% 
  dfm_select(pattern = training_dfm, 
             selection = "keep")

# train model on sentiment
model <- maxent(training_dfm, docvars(training_dfm, "Sentiment"))

# predict and evaluate
actual_class <- docvars(test_dfm, "Sentiment")
predicted_class <- predict(model, feature_matrix = test_dfm) %>% 
  tibble::as_tibble()

table(actual_class, predicted_class$labels) %>% 
  caret::confusionMatrix()
## Confusion Matrix and Statistics
## 
##             
## actual_class neg pos
##          neg  84  13
##          pos  18  85
##                                           
##                Accuracy : 0.845           
##                  95% CI : (0.7873, 0.8922)
##     No Information Rate : 0.51            
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6902          
##                                           
##  Mcnemar's Test P-Value : 0.4725          
##                                           
##             Sensitivity : 0.8235          
##             Specificity : 0.8673          
##          Pos Pred Value : 0.8660          
##          Neg Pred Value : 0.8252          
##              Prevalence : 0.5100          
##          Detection Rate : 0.4200          
##    Detection Prevalence : 0.4850          
##       Balanced Accuracy : 0.8454          
##                                           
##        'Positive' Class : neg             
## 

Compared to the algorithms already implemented, it holds up pretty well, although speed can probably be improved.

# benchmarking
res <- bench::mark(
  nb = textmodel_nb(training_dfm, docvars(training_dfm, "Sentiment")),
  nnseq = quanteda.classifiers::textmodel_nnseq(training_dfm, docvars(training_dfm, "Sentiment")),
  svm = quanteda.classifiers::textmodel_svm(training_dfm, docvars(training_dfm, "Sentiment")),
  maxent = maxent(training_dfm, docvars(training_dfm, "Sentiment")),
  check = FALSE
)
res
## # A tibble: 4 x 6
##   expression      min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 nb         184.72ms 204.64ms    5.01     71.18MB   1.67  
## 2 nnseq         1.48m    1.48m    0.0112   453.9MB   0.0112
## 3 svm           2.71s    2.71s    0.369     1.91GB   1.11  
## 4 maxent        6.98s    6.98s    0.143     1.47GB   2.58

The problem is, of course, that maxent was recently removed from CRAN as well and I don’t believe Timothy Jurka will fix this in the future. There are other packages for Multinomial Logistic Regression such as nnet and mlogit but they don’t work well with text data, as far as I can see.

So it would probably be necessary to find a new home for the code, either in a new package or in quanteda.classifiers directly. If there is intrest, I could try myself on a PR (although I don’t know much about Rcpp and it might take a while until I get to it).

Here are the metrics for the other algorithms for comparison on the same example:

``` r # cm nb predict(res$result[[1]], test_dfm) %>% table(actual_class) %>% caret::confusionMatrix() ``` ## Confusion Matrix and Statistics ## ## actual_class ## . neg pos ## neg 81 17 ## pos 16 86 ## ## Accuracy : 0.835 ## 95% CI : (0.7762, 0.8836) ## No Information Rate : 0.515 ## P-Value [Acc > NIR] : <2e-16 ## ## Kappa : 0.6698 ## ## Mcnemar's Test P-Value : 1 ## ## Sensitivity : 0.8351 ## Specificity : 0.8350 ## Pos Pred Value : 0.8265 ## Neg Pred Value : 0.8431 ## Prevalence : 0.4850 ## Detection Rate : 0.4050 ## Detection Prevalence : 0.4900 ## Balanced Accuracy : 0.8350 ## ## 'Positive' Class : neg ## ``` r # cm nnseq predict(res$result[[2]], test_dfm) %>% table(actual_class) %>% caret::confusionMatrix() ``` ## Confusion Matrix and Statistics ## ## actual_class ## . neg pos ## neg 84 16 ## pos 13 87 ## ## Accuracy : 0.855 ## 95% CI : (0.7984, 0.9007) ## No Information Rate : 0.515 ## P-Value [Acc > NIR] : <2e-16 ## ## Kappa : 0.71 ## ## Mcnemar's Test P-Value : 0.7103 ## ## Sensitivity : 0.8660 ## Specificity : 0.8447 ## Pos Pred Value : 0.8400 ## Neg Pred Value : 0.8700 ## Prevalence : 0.4850 ## Detection Rate : 0.4200 ## Detection Prevalence : 0.5000 ## Balanced Accuracy : 0.8553 ## ## 'Positive' Class : neg ## ``` r # cm svm predict(res$result[[3]], test_dfm) %>% table(actual_class) %>% caret::confusionMatrix() ``` ## Confusion Matrix and Statistics ## ## actual_class ## . neg pos ## neg 80 19 ## pos 17 84 ## ## Accuracy : 0.82 ## 95% CI : (0.7596, 0.8706) ## No Information Rate : 0.515 ## P-Value [Acc > NIR] : <2e-16 ## ## Kappa : 0.6399 ## ## Mcnemar's Test P-Value : 0.8676 ## ## Sensitivity : 0.8247 ## Specificity : 0.8155 ## Pos Pred Value : 0.8081 ## Neg Pred Value : 0.8317 ## Prevalence : 0.4850 ## Detection Rate : 0.4000 ## Detection Prevalence : 0.4950 ## Balanced Accuracy : 0.8201 ## ## 'Positive' Class : neg ##
kbenoit commented 4 years ago

Good suggestion, thanks @JBGruber. I hope to work on this for a week at the end of July, will put it on the list for then.

kbenoit commented 4 years ago

@JBGruber I'm finally moving forward on this, if you want to try a glmnet wrapper for logistic regression I think this would be a great addition. (Also I added an issue #20)

JBGruber commented 4 years ago

That’t great! I would be happy to give a PR a go, @kbenoit. But just so we are on the same page, the wrapper would bascially be for something like the code below (with the same pre-processing as above):

library(glmnet)
doMC::registerDoMC(cores = quanteda::quanteda_options("threads")) # for parallel = TRUE to work

model <- cv.glmnet(
  x = training_dfm,
  y = docvars(training_dfm, "Sentiment"),
  family = "binomial", # "multinomial" for >2 classes
  alpha = 1,
  nfolds = 10,
  type.measure = "auc",
  maxit = 10000,
  parallel = TRUE
)

# predict and evaluate
actual_class <- docvars(test_dfm, "Sentiment")
predicted_class <- predict(model, newx = test_dfm,  s = "lambda.min", type = "class")[, 1]

table(actual_class, predicted_class) %>% 
  caret::confusionMatrix()
## Confusion Matrix and Statistics
## 
##             predicted_class
## actual_class neg pos
##          neg  79  18
##          pos  13  90
##                                           
##                Accuracy : 0.845           
##                  95% CI : (0.7873, 0.8922)
##     No Information Rate : 0.54            
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6893          
##                                           
##  Mcnemar's Test P-Value : 0.4725          
##                                           
##             Sensitivity : 0.8587          
##             Specificity : 0.8333          
##          Pos Pred Value : 0.8144          
##          Neg Pred Value : 0.8738          
##              Prevalence : 0.4600          
##          Detection Rate : 0.3950          
##    Detection Prevalence : 0.4850          
##       Balanced Accuracy : 0.8460          
##                                           
##        'Positive' Class : neg             
## 

After some experimenting it seems maxent is usually still predicting classes a little better than glmnet. That might be because the default settings in maxent work a bit better, in which case I would try to tweak them a bit.

On the other hand glmnet is quite a bit faster than maxent, especially if run in parallel. For that the doMC packages is needed though, which is not a dependency so far. (It also does not work on Windows.)

Benchmark against other algorithms: ``` r res <- bench::mark( nb = textmodel_nb(training_dfm, docvars(training_dfm, "Sentiment")), svm = quanteda.classifiers::textmodel_svm(training_dfm, docvars(training_dfm, "Sentiment")), maxent = maxent::maxent(training_dfm, docvars(training_dfm, "Sentiment")), glmnet = cv.glmnet( x = training_dfm, y = docvars(training_dfm, "Sentiment"), family = "binomial", alpha = 1, nfolds = 5, type.measure = "class", maxit = 10000 ), glmnet_parallel = cv.glmnet( x = training_dfm, y = docvars(training_dfm, "Sentiment"), family = "binomial", alpha = 1, nfolds = 5, type.measure = "class", maxit = 10000, parallel = TRUE ), check = FALSE, filter_gc = FALSE, memory = FALSE ) res ``` ## # A tibble: 5 x 6 ## expression min median `itr/sec` mem_alloc `gc/sec` ## ## 1 nb 189.8ms 225.78ms 4.59 NA 3.06 ## 2 svm 3.42s 3.42s 0.292 NA 2.34 ## 3 maxent 6.69s 6.69s 0.149 NA 2.24 ## 4 glmnet 2.9s 2.9s 0.344 NA 0.344 ## 5 glmnet_parallel 2.06s 2.06s 0.484 NA 0.969 ``` r summary(res, relative = TRUE) ``` ## Warning: Some expressions had a GC in every iteration; so filtering is disabled. ## # A tibble: 5 x 6 ## expression min median `itr/sec` mem_alloc `gc/sec` ## ## 1 nb 1 1 30.7 NA 8.88 ## 2 svm 18.0 15.2 1.95 NA 6.78 ## 3 maxent 35.2 29.6 1 NA 6.51 ## 4 glmnet 15.3 12.9 2.30 NA 1 ## 5 glmnet_parallel 10.9 9.14 3.24 NA 2.81
kbenoit commented 4 years ago

Compatibility issues, what a pain. It probably makes sense to try to implement both methods from scratch, although a working wrapper would be good for now. One for penalized logistic regression and one for maxent, if this is actually different (I'm a bit hazy on this model since it seems to be found mainly in ecology). We can wean them from their wrappers later.