Closed jhagenauer closed 8 years ago
It's not really a bug.
In 99% of R functions, using the formula method will convert factor predictors to dummy variables. This is because almost all of the modeling functions require numeric representations of the data for computations.
There are some exceptions, including trees, rule-based models, and naive Bayes. For this reason, some functions keep the factors as factors. Examples include klaR:::naiveBayes
, randomForest
, etc.
train
is designed to work with a lot of models and follows the default behavior for factors. In many cases, it won't really matter much (performance-wise). For example, I have yet to find a case with trees where performance is different (although the trees will probably be larger and it takes less time to train those models when dummy variables are used ).
For naive Bayes, using the formula method with train
is probably a bad idea since it will be trying to fit a density (parametric or not) to binary data.
Hi, I train a naive bayes classifier, one time with formula interface (t2), one time with x/y-interface (1). For evaluation I use 10-fold CV with 200 repeats in order to obtain stable results. Interestingly, the performance is very different between the two classifers (accuracy t1=0.42, t2=0.34, kappa t1=0.12, t2=0.07), though the only difference lies in the used interface of the train function. I guess this problem might be related to factor dependent variables. R version 3.2.3, caret version 6.0-64.
Code for reproduction:
library(caret)
library(evtree) # for version of GermanCredit without dummy-vars
data(GermanCredit)
tc<-trainControl("repeatedcv", number=10, repeats=200 )
t1<-train(x=GermanCredit[,c("savings","duration","age","job")],y=GermanCredit[,"status"],method="nb", trControl=tc, tuneGrid=expand.grid(usekernel=T,fL=0))
t2<-train(status~savings+duration+age+job, data=GermanCredit, method="nb", trControl=tc,tuneGrid=expand.grid(usekernel=T,fL=0))
print(t1)
print(t2)