quanteda / quanteda.classifiers

quanteda textmodel extensions for classifying documents
21 stars 2 forks source link

Dev cross #24

Closed pchest closed 4 years ago

pchest commented 4 years ago

Added evaluation function that supports dfm and tokens inputs. Also, I fixed a minor bug in tokens2sequences and made it compatible with character inputs.

kbenoit commented 4 years ago

OK, I did some significant work here, simplifying the code for performance and making it more general and robust. I did this because we will merge this eventually into quanteda.textmodels (adding @pchest as an author of course - don't mind my extensive re-writing here, I do that to almost everyone - and @koheiw does it to me 😄).

Re-written, new function performance()

See ?performance. It works this way now:

library("quanteda.classifiers")
## Loading required package: quanteda
## Package version: 2.0.9000
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View
library("quanteda.textmodels")
## 
## Attaching package: 'quanteda.textmodels'
## The following object is masked from 'package:quanteda.classifiers':
## 
##     data_corpus_EPcoaldebate

dfmat <- dfm(data_corpus_moviereviews)

performance.textmodel <- function(data, truth = NULL, ...) {
  if (is.null(truth)) truth <- data$y
  performance(predict(data, type = "class"), truth, ...)
}

performance(textmodel_svmlin(dfmat, y = data_corpus_moviereviews$sentiment))
## $precision
##       neg       pos 
## 0.5161840 0.6192469 
## 
## $recall
##   neg   pos 
## 0.909 0.148 
## 
## $f1
##      neg      pos 
## 1.403328 2.606723 
## 
## $accuracy
## [1] 0.5285
## 
## $balanced_accuracy
## [1] 0.5285

performance(textmodel_nb(dfmat, y = data_corpus_moviereviews$sentiment))
## $precision
##       neg       pos 
## 0.9628906 0.9856557 
## 
## $recall
##   neg   pos 
## 0.986 0.962 
## 
## $f1
##      neg      pos 
## 1.026225 1.026876 
## 
## $accuracy
## [1] 0.974
## 
## $balanced_accuracy
## [1] 0.974

performance(textmodel_nb(dfmat, y = data_corpus_moviereviews$sentiment),
  by_class = FALSE
)
## $precision
## [1] 0.9742732
## 
## $recall
## [1] 0.974
## 
## $f1
## [1] 1.02655
## 
## $accuracy
## [1] 0.974
## 
## $balanced_accuracy
## [1] 0.974

# put into a data.frame
performance(textmodel_nb(dfmat, y = data_corpus_moviereviews$sentiment)) %>%
  data.frame()
##     precision recall       f1 accuracy balanced_accuracy
## neg 0.9628906  0.986 1.026225    0.974             0.974
## pos 0.9856557  0.962 1.026876    0.974             0.974

Suggestions for cross-validation functions

We want methods to perform cross-validation. We don't want to call it textmodel_evaluate(), since that naming convention suggests a new original textmodel. Instead, this should take a fitted model inside say evaluate (as above), or crossval() that:

See above code.

kbenoit commented 4 years ago

@pchest an alternative would be to create two functions: