topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.62k stars 633 forks source link

Something is wrong; all the RMSE metric values are missing #502

Closed gtoti closed 8 years ago

gtoti commented 8 years ago

Hello,

I am practicing on using caret for regression and in several occasions I run into this error message:

Something is wrong; all the RMSE metric values are missing

I don't seem to be able to get to the root of it. It is preventing me from using many available models.

Below, I am reporting 2 examples.

Example 1 In this example, I am using a simple dataset with random NAs added to make it more realistic.

library(ISLR); library(ggplot2); library(caret);

data(Wage)
Wage <- subset(Wage, select=-c(logwage))

# adding some random NAs
n <- nrow(Wage)
Wage[,-11] <- do.call(cbind.data.frame, lapply(Wage[,-11], 
                    function(x) {x[sample(c(1:n),floor(n/10))]<-NA
                    x}))

set.seed(1234)
inTrain <- createDataPartition(y=Wage$wage, p=0.7, list=FALSE)
training <- Wage[inTrain,]
testing <- Wage[-inTrain,]

# Create dummy variables (necessary to use preProcess)
dummies <- dummyVars(wage ~ age + jobclass + education, data = training)
dummy_tr <- as.data.frame(predict(dummies, training))
dummy_tr$wage <- training$wage

modFit <- train(wage ~ ., method="treebag", 
                preProc = c("medianImpute"),    
                na.action = na.pass,
                data=dummy_tr)

Example 2 The problem presents itself when other models are used, even with complete datasets (no NAs)

library(ISLR); library(ggplot2); library(caret);

data(Wage)
Wage <- subset(Wage, select=-c(logwage))

set.seed(1234)
inTrain <- createDataPartition(y=Wage$wage, p=0.7, list=FALSE)
training <- Wage[inTrain,]
testing <- Wage[-inTrain,]

modFit <- train(wage ~ age + jobclass + education, method="neuralnet", 
                data=training)

I was surprised to see the models struggling with these very simple scenarios. I would appreciate any help in understanding what is causing the problem.

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.6 (El Capitan)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
 [1] parallel  stats4    grid      stats     graphics  grDevices utils     datasets  methods  
[10] base     

other attached packages:
 [1] neuralnet_1.33    mboost_2.6-0      stabs_0.5-1       party_1.0-25      strucchange_1.5-1
 [6] sandwich_2.3-4    zoo_1.7-13        modeltools_0.2-21 mvtnorm_1.0-5     kernlab_0.9-24   
[11] e1071_1.6-7       plyr_1.8.4        ipred_0.9-5       rpart_4.1-10      glmnet_2.0-5     
[16] foreach_1.4.3     Matrix_1.2-6      caret_6.0-71      lattice_0.20-33   ggplot2_2.1.0    
[21] ISLR_1.0         

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.7        compiler_3.3.1     nloptr_1.0.4       iterators_1.0.8    class_7.3-14      
 [6] tools_3.3.1        lme4_1.1-12        nlme_3.1-128       gtable_0.2.0       mgcv_1.8-12       
[11] SparseM_1.72       prodlim_1.5.7      coin_1.1-2         stringr_1.1.0      MatrixModels_0.4-1
[16] nnet_7.3-12        survival_2.39-4    RANN_2.5           multcomp_1.4-6     TH.data_1.0-7     
[21] lava_1.4.4         minqa_1.2.4        reshape2_1.4.1     car_2.1-3          magrittr_1.5      
[26] nnls_1.4           scales_0.4.0       codetools_0.2-14   MASS_7.3-45        splines_3.3.1     
[31] pbkrtest_0.4-6     colorspace_1.2-6   quadprog_1.5-5     quantreg_5.29      stringi_1.1.1     
[36] munsell_0.4.3     
zachmayer commented 8 years ago

If you type warnings() you'll see that in both cases the model fit failed for every resample.

You can try fitting a single model to your data with the getModelInfo('treebag')[['treebag']]$fit and getModelInfo('treebag')[['treebag']]$predict to help you figure out what's wrong with the model or your data.

gtoti commented 8 years ago

After a closer look at the warnings, I found out what was bothering the neural network:

Warning messages:
1: In eval(expr, envir, enclos) :
  model fit failed for Resample01: layer1=1, layer2=0, layer3=0 Error in parse(text = x, keep.source = FALSE) : 
  <text>:1:27: unexpected symbol
1: .outcome ~ age+jobclass2. Information
                              ^

Which to my understanding means it was struggling with the spaces in the column names. After renaming the columns using space-free strings, I was able to produce a model.

"treebag" was struggling with the same issue, although the error I was getting here was more cryptical (I would have not made the connection with the spaces issue):

Warning messages:
1: In eval(expr, envir, enclos) :
  model fit failed for Resample01: parameter=none Error in `[.data.frame`(m, labs) : undefined columns selected

I did not find trying to fit a single model particularly informative, unless maybe I did not do it right. This is what I typed an the output I got:

> getModelInfo('treebag')[['treebag']]$fit
function(x, y, wts, param, lev, last,classProbs, ...) {
                    theDots <- list(...)
                    if(!any(names(theDots) == "keepX")) theDots$keepX <- FALSE   
                    modelArgs <- c(list(X = x, y = y), theDots)
                    if(!is.null(wts)) modelArgs$weights <- wts   
                    do.call("ipredbagg", modelArgs)
                  }

In the end, "Something is wrong; all the RMSE metric values are missing" can mean pretty much anything went wrong with the model fitting, did I get this right?

Thanks for your help. Pieces of caret are slowly but surely falling into place...

zachmayer commented 8 years ago

"Something is wrong; all the RMSE metric values are missing" can mean pretty much anything went wrong with the model fitting, did I get this right?

Yup. It basically always means the model fit failed. Usually this is a problem with the base model itself, sometimes it's a problem with the pre-processing caret does.

@topepo Maybe caret should raise a warning for columns with spaces in them and suggest running make.names(names(x)) on them?