topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.62k stars 633 forks source link

error in modeling with gbm and tuning param with caret: all the RMSE metric values are missing #568

Closed jianqin123 closed 7 years ago

jianqin123 commented 7 years ago

I'm trying to create a binary classifier, modelling with caret to optimize RMSE. The method I was attempting was gbm .I use the data in packages e1071,and try to train a regression model with gbm.the follow is my code.

library(kernlab)
library(e1071)
data(spam)
spam1<-spam
spam1$type<-as.numeric(spam1$type,as.factor=F)-1
fitCon = trainControl(method = "repeatedcv",
                      number = 5, repeats = 3,returnResamp = "all")

gbmGrid = expand.grid(.interaction.depth = c(1, 3),
                      .n.trees = c(50),
                      .shrinkage = 0.1,.n.minobsinnode=2)

spamModel <- train(type~.,spam1,method="gbm",
                  trControl = fitCon,tuneGrid = gbmGrid,
                  distribution = 'bernoulli',
                  bag.fraction = 0.5,
               #   nTrain = 0.5,
                  #   metric="RMSE",
                  # maximize=T,
                  # allowParallel=FALSE,
                  verbose = T) 

this is error information :

Something is wrong; all the RMSE metric values are missing:
      RMSE        Rsquared  
 Min.   : NA   Min.   : NA  
 1st Qu.: NA   1st Qu.: NA  
 Median : NA   Median : NA  
 Mean   :NaN   Mean   :NaN  
 3rd Qu.: NA   3rd Qu.: NA  
 Max.   : NA   Max.   : NA  
 NA's   :2     NA's   :2    
Error in train.default(x, y, weights = w, ...) : Stopping
In addition: Warning messages:
1: In train.default(x, y, weights = w, ...) :
  You are trying to do regression and your outcome only has two possible values Are you trying to do classification? If so, use a 2 level factor as your outcome column.
2: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.
> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 
[2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936   
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
[4] LC_NUMERIC=C                                                   
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936    

attached base packages:
[1] parallel  splines   stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] e1071_1.6-7     kernlab_0.9-25  plyr_1.8.4      gbm_2.1.1       survival_2.39-5 caret_6.0-73   
[7] ggplot2_2.2.1   lattice_0.20-34

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.8        magrittr_1.5       MASS_7.3-45        munsell_0.4.3      colorspace_1.3-2  
 [6] foreach_1.4.3      minqa_1.2.4        stringr_1.1.0      car_2.1-4          tools_3.3.2       
[11] nnet_7.3-12        pbkrtest_0.4-6     grid_3.3.2         gtable_0.2.0       nlme_3.1-128      
[16] mgcv_1.8-15        quantreg_5.29      class_7.3-14       MatrixModels_0.4-1 iterators_1.0.8   
[21] lme4_1.1-12        lazyeval_0.2.0     assertthat_0.1     tibble_1.2         Matrix_1.2-7.1    
[26] nloptr_1.0.4       reshape2_1.4.2     ModelMetrics_1.1.0 codetools_0.2-15   stringi_1.1.2     
[31] compiler_3.3.2     scales_0.4.1       stats4_3.3.2       SparseM_1.74 

I asked this question on stackoverflow,[(http://stackoverflow.com/questions/41501369/error-in-train-default-stopping) ,and there is some simliar question on stackoverflow and this website, but it doesn't work,

if I change the response varable (type )to factor,it works,but it is a classification model,that's not I want. thank you in advance!

samuel-rosa commented 7 years ago

I think that if you are doing a classification your target variable must be a factor, in this case a two level factor variable. See the error message.

jianqin123 commented 7 years ago

if change into factor ,how can I get a gression model , that is what I want ;thank you!

samuel-rosa commented 7 years ago

I do not understand. In your previous post you said that you are trying to build a binary classifier, but now you say that you want a regression model.

jianqin123 commented 7 years ago

Sorry about that, I mean a gression model gives the probility of positive class ,which is also a binary classfier.That's what I want.

samuel-rosa commented 7 years ago

I see. Perhaps the caret internals won't let you do that. Have you tried using gbm directly?

topepo commented 7 years ago

You don't need to pass the distribution argument. You can, but in the case where the outcome is a binary factor, it does it for you.

To be honest, you can optimize the model with RMSE but it is a pretty bad objective function for classification.

If you really want to do this, I suggest:

  1. make your outcome a factor. I also suggest making the levels something different than "0" and "1".
  2. write your own summary metric that computes RMSE. You can use the as.numeric()-1 trick in your original code to get the outcome back to 0/1.
  3. add classProbs = TRUE to the trainControl call if you are going to use the predicted class probabilities in the RMSE calculation.
jianqin123 commented 7 years ago

thanks a lot!

topepo commented 7 years ago

I'll close this but re-open it if you need to.