topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.61k stars 632 forks source link

Error in `[.data.frame`(data, , all.vars(Terms), drop = FALSE)': undefined columns selected #904

Closed marioem closed 6 years ago

marioem commented 6 years ago

Hi,

I'm trying to re-run the script which was working perfectly for me 2 years ago, but now I'm getting this error from train. Using caret 6.0-80, reinstalled from GitHub today.

Minimal dataset:

#library(VIM)
library(ggplot2)
library(caret)
library(pander)
#library(dplyr)
library(randomForest)
library(doMC)

registerDoMC(cores = 4)
modelNo <- 0
resultsdf <- data.frame(ModelID = numeric(0), OOBModelError = numeric(0), OOBTestError = numeric(0), Score = numeric(0))

score <- function(pred) {
    correct <- c("B", "A", "B", "A", "A", "E", "D", "B", "A", "A", "B", "C", "B", "A", "E", "E", "A", "B", "B", "B")
    score <- sum(pred == correct)
    score
}

trainurl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testurl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

if(!file.exists("pml-training.csv")) {
    download.file(trainurl, destfile = "pml-training.csv", method = "curl")
}

if(!file.exists("pml-testing.csv")) {
    download.file(testurl, destfile = "pml-testing.csv", method = "curl")
}

train <- read.csv("pml-training.csv",stringsAsFactors = F)
test <- read.csv("pml-testing.csv",stringsAsFactors = F)

Minimal, runnable code:

# some variables which seem to be numerical are read in as character
cv <- names(train[,sapply(train, class) == "character"])
cv <- cv[-c(1,2,3,37)]

for(i in cv){
    train[,i] <- as.numeric(train[,i])
}

cmtr <- colMeans(apply(train,2, is.na))
cmtst <- colMeans(apply(test,2, is.na))
# There is a pattern in missing data. 
unique(cmtr) # data is either complete or between 98% and 100% is missing
unique(cmtst)  # data is either complete or 100% is missing

sum(cmtr > 0)
sum(cmtst > 0)

# As in the test data there are variables which miss 100% data we remove them
# as they will not provide any value

missing <- rbind(cmtr, cmtst)
# Check if the missing data affects the same column (if yes, the result is 0)
difcol <- sum(xor(cmtr, cmtst))

if(difcol == 0){
    train <- train[,missing[1,]*missing[2,] == 0]
    test <- test[,missing[1,]*missing[2,] == 0]
} else
    cat("Variables with missing data don't coincide between train and quizz set")

predtrain <- train[,-c(1:7)]
predtrain[,which(names(predtrain) == "classe")] <- as.factor(predtrain[,which(names(predtrain) == "classe")])
predtest <- test[,-c(1:7)]

dim(predtrain)
dim(predtest)

nzv <- nearZeroVar(predtrain, saveMetrics= TRUE)
grepl("TRUE", nzv[,3:4])

# Check for correlated predictors
descrCor <- cor(predtrain[,-which(names(predtrain) == "classe")])
summary(descrCor[upper.tri(descrCor)])

highlyCorDescr <- findCorrelation(descrCor, cutoff = .75)
numofcor <- length(highlyCorDescr)
numofcor

predtrain2 <- predtrain[,-highlyCorDescr]
predtest2 <- predtest[,-highlyCorDescr]
descrCor <- cor(predtrain2[,-which(names(predtrain2) == "classe")])
summary(descrCor[upper.tri(descrCor)])

# As there are highy correlated variables in the train data set, the alternative approach to exclude them 
# can be pre-processing the training and quizz set applying Principal Component Analysis (PCA) 
# before training the model.

set.seed(987687674)
inTrain <- createDataPartition(y=predtrain$classe, p=0.9, list=FALSE)
ptrTraining <- predtrain[inTrain,]
ptrTesting <- predtrain[-inTrain,]

ctrl <- trainControl(method = "cv", allowParallel = T)
set.seed(728665723)
modelFit3 <- train(ptrTraining$classe ~ .,method="rf", preProcess="pca", 
                   trControl = ctrl, data=ptrTraining)
oob3 <- 100*(1-modelFit3$results[modelFit3$results$mtry ==
                                     as.integer(modelFit3$bestTune), 2])

confmat3 <- confusionMatrix(ptrTesting$classe,predict(modelFit3,ptrTesting))
testoob3 <- 100*(1-confmat3$overall[1])
score3 <- score(predict(modelFit3, predtest))

Session Info:

R version 3.4.4 (2018-03-15) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS High Sierra 10.13.5

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale: [1] pl_PL.UTF-8/pl_PL.UTF-8/pl_PL.UTF-8/C/pl_PL.UTF-8/pl_PL.UTF-8

attached base packages: [1] parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] caret_6.0-80 doMC_1.3.5 iterators_1.0.9 foreach_1.4.4 randomForest_4.6-14 [6] pander_0.6.1 lattice_0.20-35 ggplot2_2.2.1

loaded via a namespace (and not attached): [1] Rcpp_0.12.17 lubridate_1.7.4 tidyr_0.8.1 class_7.3-14 assertthat_0.2.0
[6] digest_0.6.15 ipred_0.9-6 psych_1.8.4 R6_2.2.2 plyr_1.8.4
[11] magic_1.5-8 stats4_3.4.4 e1071_1.6-8 httr_1.3.1 pillar_1.2.3
[16] rlang_0.2.1 curl_3.2 lazyeval_0.2.1 kernlab_0.9-26 rpart_4.1-13
[21] Matrix_1.2-14 devtools_1.13.5 splines_3.4.4 CVST_0.2-2 ddalpha_1.3.3
[26] gower_0.1.2 stringr_1.3.1 foreign_0.8-70 munsell_0.5.0 broom_0.4.4
[31] compiler_3.4.4 pkgconfig_2.0.1 mnormt_1.5-5 dimRed_0.1.0 nnet_7.3-12
[36] tidyselect_0.2.4 tibble_1.4.2 prodlim_2018.04.18 DRR_0.0.3 codetools_0.2-15
[41] RcppRoll_0.3.0 dplyr_0.7.5 withr_2.1.2 MASS_7.3-50 recipes_0.1.3
[46] ModelMetrics_1.1.0 grid_3.4.4 nlme_3.1-137 gtable_0.2.0 git2r_0.21.0.9000 [51] magrittr_1.5 scales_0.5.0 stringi_1.2.3 reshape2_1.4.3 bindrcpp_0.2.2
[56] timeDate_3043.102 robustbase_0.93-0 geometry_0.3-6 lava_1.6.1 tools_3.4.4
[61] glue_1.2.0 DEoptimR_1.0-8 purrr_0.2.5 sfsmisc_1.1-2 abind_1.4-5
[66] survival_2.42-3 yaml_2.1.19 colorspace_1.3-2 memoise_1.1.0 knitr_1.20
[71] bindr_0.1.1

topepo commented 6 years ago

Can you make a small reproducible example and run sequentially?

marioem commented 6 years ago

Hi Max,

please find attached R script as small as possible. Can’t make it any smaller as the data needs some initial clean up. The code has been executed in a clean, fresh session, with the error reproduced:

Restarting R session...

library(caret) Ładowanie wymaganego pakietu: lattice Ładowanie wymaganego pakietu: ggplot2 library(randomForest) randomForest 4.6-14 Type rfNews() to see new features/changes/bug fixes.

Dołączanie pakietu: ‘randomForest’

Następujący obiekt został zakryty z ‘package:ggplot2’:

margin

library(doMC) Ładowanie wymaganego pakietu: foreach Ładowanie wymaganego pakietu: iterators Ładowanie wymaganego pakietu: parallel registerDoMC(cores = 4) trainurl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv" testurl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv" if(!file.exists("pml-training.csv")) {

  • download.file(trainurl, destfile = "pml-training.csv", method = "curl")
  • } if(!file.exists("pml-testing.csv")) {
  • download.file(testurl, destfile = "pml-testing.csv", method = "curl")
  • } train <- read.csv("pml-training.csv",stringsAsFactors = F) test <- read.csv("pml-testing.csv",stringsAsFactors = F)

    some variables which seem to be numerical are read in as character

    cv <- names(train[,sapply(train, class) == "character"]) cv <- cv[-c(1,2,3,37)] for(i in cv){

  • train[,i] <- as.numeric(train[,i])
  • } Były 33 ostrzeżenia (użyj 'warnings()' aby je zobaczyć) cmtr <- colMeans(apply(train,2, is.na)) cmtst <- colMeans(apply(test,2, is.na)) missing <- rbind(cmtr, cmtst)

    Check if the missing data affects the same column (if yes, the result is 0)

    difcol <- sum(xor(cmtr, cmtst)) if(difcol == 0){

  • train <- train[,missing[1,]*missing[2,] == 0]
  • test <- test[,missing[1,]*missing[2,] == 0]
  • } else
  • cat("Variables with missing data don't coincide between train and quizz set") predtrain <- train[,-c(1:7)] predtrain[,which(names(predtrain) == "classe")] <- as.factor(predtrain[,which(names(predtrain) == "classe")]) predtest <- test[,-c(1:7)] set.seed(987687674) inTrain <- createDataPartition(y=predtrain$classe, p=0.9, list=FALSE) ptrTraining <- predtrain[inTrain,] ptrTesting <- predtrain[-inTrain,] ctrl <- trainControl(method = "cv", allowParallel = T) set.seed(728665723) modelFit3 <- train(ptrTraining$classe ~ .,method="rf", preProcess="pca",
  • trControl = ctrl, data=ptrTraining) Błąd w poleceniu '[.data.frame(data, , all.vars(Terms), drop = FALSE)': undefined columns selected

BRs,

Mariusz

Wiadomość napisana przez Max Kuhn notifications@github.com w dniu 24.06.2018, o godz. 00:49:

Can you make a small reproducible example and run sequentially?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/topepo/caret/issues/904#issuecomment-399715539, or mute the thread https://github.com/notifications/unsubscribe-auth/ALrAuOLPpTFhvqSWL5LN3nggHgMjOne6ks5t_sX7gaJpZM4U0EPo.

topepo commented 6 years ago

So it looks like the issue is that ranger freaks out if mtry is larger than the number of columns. This is an issue since train determines the number of columns from the original data and you are using PCA.

You should always look at the warnings:

Aggregating results
Selecting tuning parameters
Fitting mtry = 2 on full training set
Error in `[.data.frame`(data, , all.vars(Terms), drop = FALSE) : 
  undefined columns selected
In addition: There were 20 warnings (use warnings() to see them)
> warnings() 
Warning messages:
1: In randomForest.default(x, y, mtry = param$mtry, ...) :
  invalid mtry: reset to within valid range

So set the number of columns lower. You can run preProcess on the data to check and see how many columns normally get generated.

marioem commented 6 years ago

Hi Max,

I’m not getting any warnings related to randomForest when I run the code I’ve provided in this report. Consequently I’ve got no hint at mtry, so why should I’ve changed it?

This is what happens when I pass mtry to randomForrest through train:

modelFit3 <- train(ptrTraining$classe ~ .,method="rf", preProcess="pca",

  • trControl = ctrl, data=ptrTraining, mtry = 2) Something is wrong; all the Accuracy metric values are missing: Accuracy Kappa Min. : NA Min. : NA 1st Qu.: NA 1st Qu.: NA Median : NA Median : NA Mean :NaN Mean :NaN 3rd Qu.: NA 3rd Qu.: NA Max. : NA Max. : NA NA's :3 NA's :3 BŁĄD: Stopping Dodatkowo: Komunikat ostrzegawczy: W poleceniu 'nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, ': There were missing values in resampled performance measures.

modelFit3 <- train(ptrTraining$classe ~ .,method="rf", preProcess="pca",

  • trControl = ctrl, data=ptrTraining, mtry = 1) Something is wrong; all the Accuracy metric values are missing: Accuracy Kappa Min. : NA Min. : NA 1st Qu.: NA 1st Qu.: NA Median : NA Median : NA Mean :NaN Mean :NaN 3rd Qu.: NA 3rd Qu.: NA Max. : NA Max. : NA NA's :3 NA's :3 BŁĄD: Stopping Dodatkowo: Komunikat ostrzegawczy: W poleceniu 'nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, ': There were missing values in resampled performance measures.

Cannot set any non-trivial lower number of columns.

When it comes to using preProcess to estimate the number of columns:

pp <- preProcess(ptrTraining[,-53],method = "pca") pp Created from 17662 samples and 52 variables

Pre-processing:

PCA needed 25 components to capture 95 percent of the variance

modelFit3 <- train(ptrTraining$classe ~ .,method="rf", preProcess="pca",

  • trControl = ctrl, data=ptrTraining, mtry = 25) Something is wrong; all the Accuracy metric values are missing: Accuracy Kappa Min. : NA Min. : NA 1st Qu.: NA 1st Qu.: NA Median : NA Median : NA Mean :NaN Mean :NaN 3rd Qu.: NA 3rd Qu.: NA Max. : NA Max. : NA NA's :3 NA's :3 BŁĄD: Stopping Dodatkowo: Komunikat ostrzegawczy: W poleceniu 'nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, ': There were missing values in resampled performance measures.

As this is a classification case, let’s try the 'would be' default mtry value (sqrt(25) = 5)

modelFit3 <- train(ptrTraining$classe ~ .,method="rf", preProcess="pca",

  • trControl = ctrl, data=ptrTraining, mtry = 5) Something is wrong; all the Accuracy metric values are missing: Accuracy Kappa Min. : NA Min. : NA 1st Qu.: NA 1st Qu.: NA Median : NA Median : NA Mean :NaN Mean :NaN 3rd Qu.: NA 3rd Qu.: NA Max. : NA Max. : NA NA's :3 NA's :3 BŁĄD: Stopping Dodatkowo: Komunikat ostrzegawczy: W poleceniu 'nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, ': There were missing values in resampled performance measures.

No issues when using randomForrrest directly.

Did you ran a modified code so that you’ve got that much of a feedback from caret?

BRs,

Mariusz

Wiadomość napisana przez Max Kuhn notifications@github.com w dniu 01.07.2018, o godz. 02:10:

So it looks like the issue is that ranger freaks out if mtry is larger than the number of columns. This is an issue since train determines the number of columns from the original data and you are using PCA.

You should always look at the warnings:

Aggregating results Selecting tuning parameters Fitting mtry = 2 on full training set Error in [.data.frame(data, , all.vars(Terms), drop = FALSE) : undefined columns selected In addition: There were 20 warnings (use warnings() to see them)

warnings() Warning messages: 1: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range So set the number of columns lower. You can run preProcess on the data to check and see how many columns normally get generated.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/topepo/caret/issues/904#issuecomment-401574354, or mute the thread https://github.com/notifications/unsubscribe-auth/ALrAuH1zgRqAVns3JBxjyCPSouowr_beks5uCBNqgaJpZM4U0EPo.

topepo commented 6 years ago

I’m not getting any warnings related to randomForest when I run the code I’ve provided in this report.

Are you sure? This is one reason that we advise people to run without parallelism when submitting an issue; it can sometimes obscure the warnings. If you look at the output that I showed from running your example, it has "In addition: There were 20 warnings (use warnings() to see them)" so you would have to run that command to see them.

> modelFit3 <- train(ptrTraining$classe ~ .,method="rf", preProcess="pca",
+                    trControl = ctrl, data=ptrTraining, mtry = 2)

That's not how you set tuning parameters in caret. See the documentation.

When set correctly, it worked for me:

> set.seed(728665723)
> modelFit3 <- train(
+   classe ~ .,
+   method = "rf",
+   preProcess = "pca",
+   trControl = ctrl,
+   data = ptrTraining,
+   tuneGrid = data.frame(mtry = 5)
+ )
> 
> modelFit3
Random Forest 

17662 samples
   52 predictor
    5 classes: 'A', 'B', 'C', 'D', 'E' 

Pre-processing: principal component signal extraction (52), centered (52), scaled (52) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 15898, 15895, 15896, 15896, 15894, 15897, ... 
Resampling results:

  Accuracy   Kappa    
  0.9787685  0.9731426

Tuning parameter 'mtry' was held constant at a value of 5

Note the difference in the formula syntax too. You shouldn't reference the data frame in the formula; that can sometimes lead to errors.

marioem commented 6 years ago

Yes, I’m sure. The warning appear only when the allowParallel is set to FALSE, but no warnings when set to TRUE. But hey, it’s 21st century :-), every CPU is now multicore.

ctrl <- trainControl(method = "cv", allowParallel = F) set.seed(728665723) modelFit3 <- train(ptrTraining$classe ~ .,method="rf", preProcess="pca",

  • trControl = ctrl, data=ptrTraining) Błąd w poleceniu '[.data.frame(data, , all.vars(Terms), drop = FALSE)': undefined columns selected Dodatkowo: Było 20 ostrzeżenie (użyj 'warnings()' aby je zobaczyć) warnings() Komunikaty ostrzegawcze: 1: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 2: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 3: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 4: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 5: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 6: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 7: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 8: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 9: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 10: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 11: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 12: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 13: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 14: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 15: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 16: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 17: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 18: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 19: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 20: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range

Wiadomość napisana przez Max Kuhn notifications@github.com w dniu 01.07.2018, o godz. 18:45:

I’m not getting any warnings related to randomForest when I run the code I’ve provided in this report.

Are you sure? This is one reason that we advise people to run without parallelism when submitting an issue; it can sometimes obscure the warnings. If you look at the output that I showed from running your example, it has "In addition: There were 20 warnings (use warnings() to see them)" so you would have to run that command to see them.

modelFit3 <- train(ptrTraining$classe ~ .,method="rf", preProcess="pca",

When set correctly, it worked for me:

set.seed(728665723) modelFit3 <- train(

  • classe ~ .,
  • method = "rf",
  • preProcess = "pca",
  • trControl = ctrl,
  • data = ptrTraining,
  • tuneGrid = data.frame(mtry = 5)
  • )

modelFit3 Random Forest

17662 samples 52 predictor 5 classes: 'A', 'B', 'C', 'D', 'E'

Pre-processing: principal component signal extraction (52), centered (52), scaled (52) Resampling: Cross-Validated (10 fold) Summary of sample sizes: 15898, 15895, 15896, 15896, 15894, 15897, ... Resampling results:

Accuracy Kappa 0.9787685 0.9731426

Tuning parameter 'mtry' was held constant at a value of 5 Note the difference in the formula syntax too. You shouldn't reference the data frame in the formula; that can sometimes lead to errors.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/topepo/caret/issues/904#issuecomment-401618294, or mute the thread https://github.com/notifications/unsubscribe-auth/ALrAuCf8WvEyVDU4wwOcfD7jPHeDHTiRks5uCPyfgaJpZM4U0EPo.

topepo commented 6 years ago

The warning appear only when the allowParallel is set to FALSE, but no warnings when set to TRUE.

That was my point.

marioem commented 6 years ago

Hi Max,

Closed? Whether the allowParallel is set to True or False the error is still present. So what’s the verdict? Not a bug???

BRs,

Mariusz

Wiadomość napisana przez Max Kuhn notifications@github.com w dniu 02.07.2018, o godz. 17:17:

Closed #904 https://github.com/topepo/caret/issues/904.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/topepo/caret/issues/904#event-1711897633, or mute the thread https://github.com/notifications/unsubscribe-auth/ALrAuM5BATjsHp01TPlZb8Jo7wRy65Olks5uCjmKgaJpZM4U0EPo.

topepo commented 6 years ago

Sorry; I thought that my worked example solved it for you.

Not really a bug, just not obvious.

Basically, since you are using PCA, the data that the model receives has less columns than the original. If you don't specify mtry, train tries to make a gird for you from the original data (which is documented).

The solution is to set mtry to a smaller number:

> set.seed(728665723)
> modelFit3 <- train(
+   classe ~ .,
+   method = "rf",
+   preProcess = "pca",
+   trControl = ctrl,
+   data = ptrTraining,
+   tuneGrid = data.frame(mtry = 5)
+ )
marioem commented 6 years ago

But this code worked with caret version which was current two years ago without the need to specify mtry, so it was somehow taking care of proper setting of mtry down the line. Just made a quick scan through train help page - can’t see any hint on that. Do you plan any documentation update to cover this case?

BRs,

Mariusz

Wiadomość napisana przez Max Kuhn notifications@github.com w dniu 02.07.2018, o godz. 22:18:

Sorry; I thought that my worked example solved it for you.

Not really a bug, just not obvious.

Basically, since you are using PCA, the data that the model receives has less columns than the original. If you don't specify mtry, train tries to make a gird for you from the original data (which is documented).

The solution is to set mtry to a smaller number:

set.seed(728665723) modelFit3 <- train(

oattah1 commented 5 years ago

Hi, I am getting the small error, but I am using the lm method

`data.excel <- read_excel("Methotrexate and Vincristine Synergy Database.xlsx")

data.trial <-data.frame(data.excel)

synergy <- data.trial[,8]

metho.dose.conc <- data.trial[, 4]

vinc.dose.conc <- data.trial[, 7]

data.for.synergy.1 <- data.trial[, c(4,7,8)]

model <- train(synergy ~ metho.dose.conc + vinc.dose.conc, data.for.synergy.1, method= "lm", trControl = trainControl(method = 'cv', number =10, verboseIter = TRUE, savePredictions = TRUE, classProbs = TRUE))`