quanteda / quanteda.classifiers

quanteda textmodel extensions for classifying documents
21 stars 2 forks source link

Add logistic regression method #25

Closed JBGruber closed 4 years ago

JBGruber commented 4 years ago

This PR implements a logistic regression classifier for 2 and >2 classes as discussed in #14 and #20. I finally got the tests to work after I had some issues with the changes for quanteda 2.0 the last time I tried. Now the tests work locally but Travis still seems to have the same problems as in #23.

Below I put together a short demo of the functions. Let me know what you think and what needs to be changed.

two classes (binomial)

``` r library(quanteda.classifiers) corp <- quanteda.textmodels::data_corpus_moviereviews set.seed(300) train_size <- length(docnames(corp)) * 0.9 id_train <- sample(docnames(corp), size = train_size, replace = FALSE) # get training set training_dfm <- corpus_subset(corp, docnames(corp) %in% id_train) %>% dfm(stem = TRUE) # get test set (documents not in id_train, make features equal) test_dfm <- corpus_subset(corp, !docnames(corp) %in% id_train) %>% dfm(stem = TRUE) %>% dfm_match(featnames(training_dfm)) ```
# train model on sentiment
model <- textmodel_lr(training_dfm, docvars(training_dfm, "sentiment"))
model
## 
## Call:
## textmodel_lr.dfm(x = training_dfm, y = docvars(training_dfm, 
##     "sentiment"))
## 
## 1,800 training documents; 30,127 fitted features.
## Method: binomial logistic regression
# predict and evaluate
actual_class <- docvars(test_dfm, "sentiment")
predicted_class <- predict(model, newdata = test_dfm)

table(actual_class, predicted_class) %>% 
    caret::confusionMatrix()
## Confusion Matrix and Statistics
## 
##             predicted_class
## actual_class neg pos
##          neg  79  18
##          pos  14  89
##                                           
##                Accuracy : 0.84            
##                  95% CI : (0.7817, 0.8879)
##     No Information Rate : 0.535           
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6793          
##                                           
##  Mcnemar's Test P-Value : 0.5959          
##                                           
##             Sensitivity : 0.8495          
##             Specificity : 0.8318          
##          Pos Pred Value : 0.8144          
##          Neg Pred Value : 0.8641          
##              Prevalence : 0.4650          
##          Detection Rate : 0.3950          
##    Detection Prevalence : 0.4850          
##       Balanced Accuracy : 0.8406          
##                                           
##        'Positive' Class : neg             
## 

The coefficients are a dgCMatrix in this case, which I think makes sense given that normally most values will be 0.

coefs <- coef(model)
coefs %>% 
    head(10)
## 10 x 1 sparse Matrix of class "dgCMatrix"
##                    pos
## (Intercept) -0.4049133
## the          .        
## happi        .        
## bastard      .        
## quick        .        
## movi         .        
## review       .        
## damn         .        
## that         .        
## y2k          .

Maybe it would make sense to show in an example that the coefficients can be used to show which words are most important for the classifier.

library(dplyr)
coefs %>%
    as.matrix() %>% 
    as_tibble(rownames = "word") %>% 
    arrange(-pos) %>% 
    head()
## # A tibble: 6 x 2
##   word         pos
##   <chr>      <dbl>
## 1 efect      1.66 
## 2 250        0.991
## 3 neccessari 0.967
## 4 cricket    0.939
## 5 standoff   0.930
## 6 gingrich   0.891

more than two classes (multinomial)

``` r corp <- quanteda.corpora::data_corpus_sotu %>% corpus_subset(President %in% c("Trump", "Obama", "Lincoln")) %>% corpus_reshape(to = "sentences") set.seed(1) train_size <- length(docnames(corp)) * 0.8 id_train <- sample(docnames(corp), size = train_size, replace = FALSE) # get training set training_dfm <- corpus_subset(corp, docnames(corp) %in% id_train) %>% dfm(stem = TRUE) table(docvars(corp)$President) ``` ## ## Lincoln Obama Trump ## 939 2949 1448 ``` r table(docvars(training_dfm)$President) ``` ## ## Lincoln Obama Trump ## 749 2357 1162 ``` r # get test set (documents not in id_train, make features equal) test_dfm <- corpus_subset(corp, !docnames(corp) %in% id_train) %>% dfm(stem = TRUE) %>% dfm_match(featnames(training_dfm)) ```
# parallel computing can be turned on if DoMC is registered first (undocumented so far)
doMC::registerDoMC(cores = quanteda::quanteda_options("threads"))
# train model on sentiment
model <- textmodel_lr(training_dfm, docvars(training_dfm, "President"), parallel = TRUE)
model
## 
## Call:
## textmodel_lr.dfm(x = training_dfm, y = docvars(training_dfm, 
##     "President"), parallel = TRUE)
## 
## 4,268 training documents; 5,436 fitted features.
## Method: multinomial logistic regression
# predict and evaluate
actual_class <- docvars(test_dfm, "President")
predicted_class <- predict(model, newdata = test_dfm)

table(actual_class, predicted_class) %>% 
    caret::confusionMatrix()
## Confusion Matrix and Statistics
## 
##             predicted_class
## actual_class Lincoln Obama Trump
##      Lincoln     126    55     9
##      Obama         8   533    51
##      Trump         8   129   149
## 
## Overall Statistics
##                                          
##                Accuracy : 0.7566         
##                  95% CI : (0.7297, 0.782)
##     No Information Rate : 0.6713         
##     P-Value [Acc > NIR] : 7.383e-10      
##                                          
##                   Kappa : 0.5588         
##                                          
##  Mcnemar's Test P-Value : 7.261e-15      
## 
## Statistics by Class:
## 
##                      Class: Lincoln Class: Obama Class: Trump
## Sensitivity                  0.8873       0.7434       0.7129
## Specificity                  0.9309       0.8319       0.8405
## Pos Pred Value               0.6632       0.9003       0.5210
## Neg Pred Value               0.9818       0.6134       0.9233
## Prevalence                   0.1330       0.6713       0.1957
## Detection Rate               0.1180       0.4991       0.1395
## Detection Prevalence         0.1779       0.5543       0.2678
## Balanced Accuracy            0.9091       0.7876       0.7767

For multinomial classification, coefficients appear side by side:

coefs <- coef(model)
coefs %>% 
    head(10)
## 10 x 3 sparse Matrix of class "dgCMatrix"
##                   Lincoln     Obama       Trump
## (Intercept)    -1.3759612 0.9993150  0.37664629
## fellow-citizen  2.2492542 .          .         
## of              0.2670290 .         -0.08167894
## the             0.4698888 .          .         
## senat           .         .          .         
## and             .         .          .         
## hous            .         0.8095898  .         
## repres          .         .          .         
## :               .         .          .         
## in              .         .          .
kbenoit commented 4 years ago

Fantastic! It will take me a few days to review this (grading now...) but could I make are request? This really belongs in quanteda.textmodels. We're going to keep the classifiers package for mostly internal usage. The SVM code for instance is already moved to quanteda.textmodels.

JBGruber commented 4 years ago

Ah, I was wondering about this since many functions have been moved there yet this repo is also still active. I don't mind making a new (identical) PR in quanteda.textmodels.

kbenoit commented 4 years ago

Thanks @JBGruber that would be great. We're keeping the keras stuff in here since it remains experimental. I'll soon add a note to this effect.

JBGruber commented 4 years ago

Ok, I copied the functions and tests over to quanteda.textmodels and created a PR there: https://github.com/quanteda/quanteda.textmodels/pull/25