topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.62k stars 633 forks source link

Formula- vs x/y-interface... performance differences #370

Closed jhagenauer closed 8 years ago

jhagenauer commented 8 years ago

Hi, I train a naive bayes classifier, one time with formula interface (t2), one time with x/y-interface (1). For evaluation I use 10-fold CV with 200 repeats in order to obtain stable results. Interestingly, the performance is very different between the two classifers (accuracy t1=0.42, t2=0.34, kappa t1=0.12, t2=0.07), though the only difference lies in the used interface of the train function. I guess this problem might be related to factor dependent variables. R version 3.2.3, caret version 6.0-64.

Code for reproduction: library(caret) library(evtree) # for version of GermanCredit without dummy-vars data(GermanCredit) tc<-trainControl("repeatedcv", number=10, repeats=200 ) t1<-train(x=GermanCredit[,c("savings","duration","age","job")],y=GermanCredit[,"status"],method="nb", trControl=tc, tuneGrid=expand.grid(usekernel=T,fL=0)) t2<-train(status~savings+duration+age+job, data=GermanCredit, method="nb", trControl=tc,tuneGrid=expand.grid(usekernel=T,fL=0)) print(t1) print(t2)

topepo commented 8 years ago

It's not really a bug.

In 99% of R functions, using the formula method will convert factor predictors to dummy variables. This is because almost all of the modeling functions require numeric representations of the data for computations.

There are some exceptions, including trees, rule-based models, and naive Bayes. For this reason, some functions keep the factors as factors. Examples include klaR:::naiveBayes, randomForest, etc.

train is designed to work with a lot of models and follows the default behavior for factors. In many cases, it won't really matter much (performance-wise). For example, I have yet to find a case with trees where performance is different (although the trees will probably be larger and it takes less time to train those models when dummy variables are used ).

For naive Bayes, using the formula method with train is probably a bad idea since it will be trying to fit a density (parametric or not) to binary data.