Open JBGruber opened 5 years ago
Good suggestion, thanks @JBGruber. I hope to work on this for a week at the end of July, will put it on the list for then.
@JBGruber I'm finally moving forward on this, if you want to try a glmnet wrapper for logistic regression I think this would be a great addition. (Also I added an issue #20)
That’t great! I would be happy to give a PR a go, @kbenoit. But just so we are on the same page, the wrapper would bascially be for something like the code below (with the same pre-processing as above):
library(glmnet)
doMC::registerDoMC(cores = quanteda::quanteda_options("threads")) # for parallel = TRUE to work
model <- cv.glmnet(
x = training_dfm,
y = docvars(training_dfm, "Sentiment"),
family = "binomial", # "multinomial" for >2 classes
alpha = 1,
nfolds = 10,
type.measure = "auc",
maxit = 10000,
parallel = TRUE
)
# predict and evaluate
actual_class <- docvars(test_dfm, "Sentiment")
predicted_class <- predict(model, newx = test_dfm, s = "lambda.min", type = "class")[, 1]
table(actual_class, predicted_class) %>%
caret::confusionMatrix()
## Confusion Matrix and Statistics
##
## predicted_class
## actual_class neg pos
## neg 79 18
## pos 13 90
##
## Accuracy : 0.845
## 95% CI : (0.7873, 0.8922)
## No Information Rate : 0.54
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6893
##
## Mcnemar's Test P-Value : 0.4725
##
## Sensitivity : 0.8587
## Specificity : 0.8333
## Pos Pred Value : 0.8144
## Neg Pred Value : 0.8738
## Prevalence : 0.4600
## Detection Rate : 0.3950
## Detection Prevalence : 0.4850
## Balanced Accuracy : 0.8460
##
## 'Positive' Class : neg
##
After some experimenting it seems maxent
is usually still predicting classes a little better than glmnet
. That might be because the default settings in maxent
work a bit better, in which case I would try to tweak them a bit.
On the other hand glmnet
is quite a bit faster than maxent
, especially if run in parallel. For that the doMC
packages is needed though, which is not a dependency so far. (It also does not work on Windows.)
Compatibility issues, what a pain. It probably makes sense to try to implement both methods from scratch, although a working wrapper would be good for now. One for penalized logistic regression and one for maxent, if this is actually different (I'm a bit hazy on this model since it seems to be found mainly in ecology). We can wean them from their wrappers later.
First of all: thanks for this great package! Since
RTextTools
was recently removed from CRAN I was trying to find a good solution for SML on text data inR
and was a bit frustrated bycaret
which which is not optimized for text. So I thinkquanteda.classifiers
fills a real gap in theR
ecosystem right now.One thing that I was missing so far though is an implementation of multinomial logistic regression, also known as maximum entropy. In my experience it often outperforms other algorithms (especially for multi class classification). Furthermore, the implementation by Timothy Jurka (Jurka 2012) already works pretty well with the ‘quantedaverse’. Here is a quick example with the movies corpus:
Compared to the algorithms already implemented, it holds up pretty well, although speed can probably be improved.
The problem is, of course, that
maxent
was recently removed from CRAN as well and I don’t believe Timothy Jurka will fix this in the future. There are other packages for Multinomial Logistic Regression such as nnet and mlogit but they don’t work well with text data, as far as I can see.So it would probably be necessary to find a new home for the code, either in a new package or in
quanteda.classifiers
directly. If there is intrest, I could try myself on a PR (although I don’t know much about Rcpp and it might take a while until I get to it).Here are the metrics for the other algorithms for comparison on the same example: