topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.62k stars 633 forks source link

glmnet, lasso, gbm throw error with importance = TRUE #580

Closed dernesa closed 7 years ago

dernesa commented 7 years ago

Dear Max,

I trained a series of regression models with the same settings using caret. For some (glmnet, lasso, gbm), I always got the same error:

Minimal, runnable error code:

This:

library(caret)

# Lasso
set.seed(1)
fit <- train(mpg~.
             , data= mtcars
             , method = 'lasso'
             , importance = TRUE
)

produces this error:

      RMSE        Rsquared  
 Min.   : NA   Min.   : NA  
 1st Qu.: NA   1st Qu.: NA  
 Median : NA   Median : NA  
 Mean   :NaN   Mean   :NaN  
 3rd Qu.: NA   3rd Qu.: NA  
 Max.   : NA   Max.   : NA  
 NA's   :3     NA's   :3    
Error in train.default(x, y, weights = w, ...) : Stopping
In addition: There were 26 warnings (use warnings() to see them)```

After a while of playing around I found that this is due to the option: `importance = TRUE`  I had set in the train function. 

#### Minimal, runnable working code:
This
```{R error=TRUE, message=FALSE, warning=FALSE, error=TRUE}
library(caret)

# Lasso
set.seed(1)
fit <- train(mpg~.
             , data= mtcars
             , method = 'lasso'
             # , importance = TRUE
)
fit

works just fine:

The lasso 

32 samples
10 predictors

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 32, 32, 32, 32, 32, 32, ... 
Resampling results across tuning parameters:

  fraction  RMSE      Rsquared 
  0.1       4.485024  0.7857763
  0.5       3.459266  0.7333778
  0.9       4.338400  0.6375536

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was fraction = 0.5. 

However, it took me quite a while to find my error, because to me the error message implied that the models could not cope with the data I fed them. → I had a more complicated expression with NA imputation in the training loop etc..

Maybe it is worth fixing… Other models deal with the additional importance = TRUE just fine.

Thanks for caret!! I use it a lot!

cheers!!

Session Info:

R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.2

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] elasticnet_1.1  lars_1.2        caret_6.0-73    ggplot2_2.2.1   lattice_0.20-34

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.8        knitr_1.15.1       magrittr_1.5       splines_3.3.2      MASS_7.3-45        munsell_0.4.3     
 [7] colorspace_1.3-2   foreach_1.4.3      minqa_1.2.4        stringr_1.1.0      car_2.1-4          plyr_1.8.4        
[13] tools_3.3.2        parallel_3.3.2     nnet_7.3-12        pbkrtest_0.4-6     grid_3.3.2         gtable_0.2.0      
[19] nlme_3.1-128       mgcv_1.8-15        quantreg_5.29      MatrixModels_0.4-1 iterators_1.0.8    lme4_1.1-12       
[25] lazyeval_0.2.0     assertthat_0.1     tibble_1.2         Matrix_1.2-7.1     nloptr_1.0.4       reshape2_1.4.2    
[31] ModelMetrics_1.1.0 codetools_0.2-15   stringi_1.1.2      compiler_3.3.2     scales_0.4.1       stats4_3.3.2      
[37] SparseM_1.74    
dernesa commented 7 years ago

Furthermore: I get the same error for:

It seems the option has to be gone, not to cause an error....

dernesa commented 7 years ago

I guess, I now understand that the importance option is only meant for the randomForest routine as a ... statement.

...     arguments passed to the classification or regression routine (such as randomForest). Errors will occur if values for tuning parameters are passed here. 

So it seems that glmnet, lasso and gbm are not tolerating wrong arguments very well.

topepo commented 7 years ago

Yes, the extra arguments are specific to the modeling function. Sorry for the mixup