Closed pchest closed 4 years ago
OK, I did some significant work here, simplifying the code for performance and making it more general and robust. I did this because we will merge this eventually into quanteda.textmodels (adding @pchest as an author of course - don't mind my extensive re-writing here, I do that to almost everyone - and @koheiw does it to me 😄).
performance()
See ?performance
. It works this way now:
library("quanteda.classifiers")
## Loading required package: quanteda
## Package version: 2.0.9000
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
library("quanteda.textmodels")
##
## Attaching package: 'quanteda.textmodels'
## The following object is masked from 'package:quanteda.classifiers':
##
## data_corpus_EPcoaldebate
dfmat <- dfm(data_corpus_moviereviews)
performance.textmodel <- function(data, truth = NULL, ...) {
if (is.null(truth)) truth <- data$y
performance(predict(data, type = "class"), truth, ...)
}
performance(textmodel_svmlin(dfmat, y = data_corpus_moviereviews$sentiment))
## $precision
## neg pos
## 0.5161840 0.6192469
##
## $recall
## neg pos
## 0.909 0.148
##
## $f1
## neg pos
## 1.403328 2.606723
##
## $accuracy
## [1] 0.5285
##
## $balanced_accuracy
## [1] 0.5285
performance(textmodel_nb(dfmat, y = data_corpus_moviereviews$sentiment))
## $precision
## neg pos
## 0.9628906 0.9856557
##
## $recall
## neg pos
## 0.986 0.962
##
## $f1
## neg pos
## 1.026225 1.026876
##
## $accuracy
## [1] 0.974
##
## $balanced_accuracy
## [1] 0.974
performance(textmodel_nb(dfmat, y = data_corpus_moviereviews$sentiment),
by_class = FALSE
)
## $precision
## [1] 0.9742732
##
## $recall
## [1] 0.974
##
## $f1
## [1] 1.02655
##
## $accuracy
## [1] 0.974
##
## $balanced_accuracy
## [1] 0.974
# put into a data.frame
performance(textmodel_nb(dfmat, y = data_corpus_moviereviews$sentiment)) %>%
data.frame()
## precision recall f1 accuracy balanced_accuracy
## neg 0.9628906 0.986 1.026225 0.974 0.974
## pos 0.9856557 0.962 1.026876 0.974 0.974
We want methods to perform cross-validation. We don't want to call it textmodel_evaluate()
, since that naming convention suggests a new original textmodel. Instead, this should take a fitted model inside say evaluate (as above), or crossval()
that:
See above code.
@pchest an alternative would be to create two functions:
crossval()
as is, but calls validate()
five times
validate(x, split)
where split is 0 or 1, meaning 0 is used in the model, 1 is predicted. So crossval()
can do the folds, but split()
can be used with any input for a split. This allows users to use caret or other package functions to do the splits, in case they want stratification or anything more fancy.
Added evaluation function that supports dfm and tokens inputs. Also, I fixed a minor bug in tokens2sequences and made it compatible with character inputs.