Add textmodel_lr for logistic regression

JBGruber commented 4 years ago

This PR implements a logistic regression classifier for 2 and >2 classes as discussed in quanteda/quanteda.classifiers#14 and quanteda/quanteda.classifiers#20.

Below I put together a short demo of the functions. I think there are still some open question regarding design (e.g., how to treat parallel processing, as it requires another package here). Let me know what you think.

two classes (binomial)

``` r library(quanteda) library(quanteda.textmodels) corp <- data_corpus_moviereviews set.seed(300) train_size <- length(quanteda::docnames(corp)) * 0.9 id_train <- sample(quanteda::docnames(corp), size = train_size, replace = FALSE) # get training set training_dfm <- corpus_subset(corp, quanteda::docnames(corp) %in% id_train) %>% dfm(stem = TRUE) # get test set (documents not in id_train, make features equal) test_dfm <- corpus_subset(corp, !quanteda::docnames(corp) %in% id_train) %>% dfm(stem = TRUE) %>% dfm_match(featnames(training_dfm)) ```

# train model on sentiment
model <- textmodel_lr(training_dfm, docvars(training_dfm, "sentiment"))
model

## 
## Call:
## textmodel_lr.dfm(x = training_dfm, y = docvars(training_dfm, 
##     "sentiment"))
## 
## 1,800 training documents; 30,127 fitted features.
## Method: binomial logistic regression

# predict and evaluate
actual_class <- docvars(test_dfm, "sentiment")
predicted_class <- predict(model, newdata = test_dfm)

table(actual_class, predicted_class) %>% 
    caret::confusionMatrix()

## Confusion Matrix and Statistics
## 
##             predicted_class
## actual_class neg pos
##          neg  79  18
##          pos  14  89
##                                           
##                Accuracy : 0.84            
##                  95% CI : (0.7817, 0.8879)
##     No Information Rate : 0.535           
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6793          
##                                           
##  Mcnemar's Test P-Value : 0.5959          
##                                           
##             Sensitivity : 0.8495          
##             Specificity : 0.8318          
##          Pos Pred Value : 0.8144          
##          Neg Pred Value : 0.8641          
##              Prevalence : 0.4650          
##          Detection Rate : 0.3950          
##    Detection Prevalence : 0.4850          
##       Balanced Accuracy : 0.8406          
##                                           
##        'Positive' Class : neg             
##

The coefficients are a dgCMatrix in this case, which I think makes sense given that normally most values will be 0.

coefs <- coef(model)
coefs %>% 
    head(10)

## 10 x 1 sparse Matrix of class "dgCMatrix"
##                    pos
## (Intercept) -0.4049133
## the          .        
## happi        .        
## bastard      .        
## quick        .        
## movi         .        
## review       .        
## damn         .        
## that         .        
## y2k          .

Maybe it would make sense to show in an example that the coefficients can be used to show which words are most important in the model.

library(dplyr)
coefs %>%
    as.matrix() %>% 
    as_tibble(rownames = "word") %>% 
    arrange(-pos) %>% 
    head()

## # A tibble: 6 x 2
##   word         pos
##   <chr>      <dbl>
## 1 efect      1.66 
## 2 250        0.991
## 3 neccessari 0.967
## 4 cricket    0.939
## 5 standoff   0.930
## 6 gingrich   0.891

more than two classes (multinomial)

``` r corp <- quanteda.corpora::data_corpus_sotu %>% corpus_subset(President %in% c("Trump", "Obama", "Lincoln")) %>% corpus_reshape(to = "sentences") set.seed(1) train_size <- length(docnames(corp)) * 0.8 id_train <- sample(docnames(corp), size = train_size, replace = FALSE) # get training set training_dfm <- corpus_subset(corp, docnames(corp) %in% id_train) %>% dfm(stem = TRUE) table(docvars(corp)$President) ``` ## ## Lincoln Obama Trump ## 939 2949 1448 ``` r table(docvars(training_dfm)$President) ``` ## ## Lincoln Obama Trump ## 749 2357 1162 ``` r # get test set (documents not in id_train, make features equal) test_dfm <- corpus_subset(corp, !docnames(corp) %in% id_train) %>% dfm(stem = TRUE) %>% dfm_match(featnames(training_dfm)) ```

# parallel computing can be turned on if DoMC is registered first (undocumented so far)
doMC::registerDoMC(cores = quanteda::quanteda_options("threads"))
# train model on sentiment
model <- textmodel_lr(training_dfm, docvars(training_dfm, "President"), parallel = TRUE)
model

## 
## Call:
## textmodel_lr.dfm(x = training_dfm, y = docvars(training_dfm, 
##     "President"), parallel = TRUE)
## 
## 4,268 training documents; 5,436 fitted features.
## Method: multinomial logistic regression

# predict and evaluate
actual_class <- docvars(test_dfm, "President")
predicted_class <- predict(model, newdata = test_dfm)

table(actual_class, predicted_class) %>% 
    caret::confusionMatrix()

## Confusion Matrix and Statistics
## 
##             predicted_class
## actual_class Lincoln Obama Trump
##      Lincoln     126    55     9
##      Obama         8   533    51
##      Trump         8   129   149
## 
## Overall Statistics
##                                          
##                Accuracy : 0.7566         
##                  95% CI : (0.7297, 0.782)
##     No Information Rate : 0.6713         
##     P-Value [Acc > NIR] : 7.383e-10      
##                                          
##                   Kappa : 0.5588         
##                                          
##  Mcnemar's Test P-Value : 7.261e-15      
## 
## Statistics by Class:
## 
##                      Class: Lincoln Class: Obama Class: Trump
## Sensitivity                  0.8873       0.7434       0.7129
## Specificity                  0.9309       0.8319       0.8405
## Pos Pred Value               0.6632       0.9003       0.5210
## Neg Pred Value               0.9818       0.6134       0.9233
## Prevalence                   0.1330       0.6713       0.1957
## Detection Rate               0.1180       0.4991       0.1395
## Detection Prevalence         0.1779       0.5543       0.2678
## Balanced Accuracy            0.9091       0.7876       0.7767

For multinomial classification, coefficients appear side by side:

coefs <- coef(model)
coefs %>% 
    head(10)

## 10 x 3 sparse Matrix of class "dgCMatrix"
##                   Lincoln     Obama       Trump
## (Intercept)    -1.3759612 0.9993150  0.37664629
## fellow-citizen  2.2492542 .          .         
## of              0.2670290 .         -0.08167894
## the             0.4698888 .          .         
## senat           .         .          .         
## and             .         .          .         
## hous            .         0.8095898  .         
## repres          .         .          .         
## :               .         .          .         
## in              .         .          .

kbenoit commented 4 years ago

@JBGruber can you enable me to make mods to your PR branch? See https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/allowing-changes-to-a-pull-request-branch-created-from-a-fork

JBGruber commented 4 years ago

You should be able to make mods to the branch as far as I know (and according to the help page you linked). I left the default:

kbenoit commented 4 years ago

I think if you grant me permission to push to your branch, I can add my edits.

(base) kbenoit@KB-Office-iMac quanteda.textmodels % git remote -v
origin  https://github.com/quanteda/quanteda.textmodels.git (fetch)
origin  https://github.com/quanteda/quanteda.textmodels.git (push)
upstream        https://github.com/JBGruber/quanteda.textmodels.git (fetch)
upstream        https://github.com/JBGruber/quanteda.textmodels.git (push)
(base) kbenoit@KB-Office-iMac quanteda.textmodels % git push upstream                                                          
Total 0 (delta 0), reused 0 (delta 0)
To https://github.com/JBGruber/quanteda.textmodels.git
 ! [remote rejected] JBGruber-master -> JBGruber-master (permission denied)
error: failed to push some refs to 'https://github.com/JBGruber/quanteda.textmodels.git'

JBGruber commented 4 years ago

Ok, you should have an invitation.

JBGruber commented 4 years ago

Thanks for adding the function and thanks for adding me as an author of the package :blush:!

kbenoit commented 4 years ago

Give it a thorough lookover, since I made some changes in the final mix, but could not easily feed them back to your fork. I think I forked your fork, then issued a PR for your PR, which got complicated! In the future you can just create dev branches on this repo.

quanteda / quanteda.textmodels

Add textmodel_lr for logistic regression #25

two classes (binomial)

more than two classes (multinomial)