Closed marioem closed 6 years ago
Can you make a small reproducible example and run sequentially?
Hi Max,
please find attached R script as small as possible. Can’t make it any smaller as the data needs some initial clean up. The code has been executed in a clean, fresh session, with the error reproduced:
Restarting R session...
library(caret) Ładowanie wymaganego pakietu: lattice Ładowanie wymaganego pakietu: ggplot2 library(randomForest) randomForest 4.6-14 Type rfNews() to see new features/changes/bug fixes.
Dołączanie pakietu: ‘randomForest’
Następujący obiekt został zakryty z ‘package:ggplot2’:
margin
library(doMC) Ładowanie wymaganego pakietu: foreach Ładowanie wymaganego pakietu: iterators Ładowanie wymaganego pakietu: parallel registerDoMC(cores = 4) trainurl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv" testurl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv" if(!file.exists("pml-training.csv")) {
- download.file(trainurl, destfile = "pml-training.csv", method = "curl")
- } if(!file.exists("pml-testing.csv")) {
- download.file(testurl, destfile = "pml-testing.csv", method = "curl")
- } train <- read.csv("pml-training.csv",stringsAsFactors = F) test <- read.csv("pml-testing.csv",stringsAsFactors = F)
some variables which seem to be numerical are read in as character
cv <- names(train[,sapply(train, class) == "character"]) cv <- cv[-c(1,2,3,37)] for(i in cv){
- train[,i] <- as.numeric(train[,i])
- } Były 33 ostrzeżenia (użyj 'warnings()' aby je zobaczyć) cmtr <- colMeans(apply(train,2, is.na)) cmtst <- colMeans(apply(test,2, is.na)) missing <- rbind(cmtr, cmtst)
Check if the missing data affects the same column (if yes, the result is 0)
difcol <- sum(xor(cmtr, cmtst)) if(difcol == 0){
- train <- train[,missing[1,]*missing[2,] == 0]
- test <- test[,missing[1,]*missing[2,] == 0]
- } else
- cat("Variables with missing data don't coincide between train and quizz set") predtrain <- train[,-c(1:7)] predtrain[,which(names(predtrain) == "classe")] <- as.factor(predtrain[,which(names(predtrain) == "classe")]) predtest <- test[,-c(1:7)] set.seed(987687674) inTrain <- createDataPartition(y=predtrain$classe, p=0.9, list=FALSE) ptrTraining <- predtrain[inTrain,] ptrTesting <- predtrain[-inTrain,] ctrl <- trainControl(method = "cv", allowParallel = T) set.seed(728665723) modelFit3 <- train(ptrTraining$classe ~ .,method="rf", preProcess="pca",
- trControl = ctrl, data=ptrTraining) Błąd w poleceniu '
[.data.frame
(data, , all.vars(Terms), drop = FALSE)': undefined columns selected
BRs,
Mariusz
Wiadomość napisana przez Max Kuhn notifications@github.com w dniu 24.06.2018, o godz. 00:49:
Can you make a small reproducible example and run sequentially?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/topepo/caret/issues/904#issuecomment-399715539, or mute the thread https://github.com/notifications/unsubscribe-auth/ALrAuOLPpTFhvqSWL5LN3nggHgMjOne6ks5t_sX7gaJpZM4U0EPo.
So it looks like the issue is that ranger
freaks out if mtry
is larger than the number of columns. This is an issue since train
determines the number of columns from the original data and you are using PCA.
You should always look at the warnings:
Aggregating results
Selecting tuning parameters
Fitting mtry = 2 on full training set
Error in `[.data.frame`(data, , all.vars(Terms), drop = FALSE) :
undefined columns selected
In addition: There were 20 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: In randomForest.default(x, y, mtry = param$mtry, ...) :
invalid mtry: reset to within valid range
So set the number of columns lower. You can run preProcess
on the data to check and see how many columns normally get generated.
Hi Max,
I’m not getting any warnings related to randomForest when I run the code I’ve provided in this report. Consequently I’ve got no hint at mtry, so why should I’ve changed it?
This is what happens when I pass mtry to randomForrest through train:
modelFit3 <- train(ptrTraining$classe ~ .,method="rf", preProcess="pca",
- trControl = ctrl, data=ptrTraining, mtry = 2) Something is wrong; all the Accuracy metric values are missing: Accuracy Kappa Min. : NA Min. : NA 1st Qu.: NA 1st Qu.: NA Median : NA Median : NA Mean :NaN Mean :NaN 3rd Qu.: NA 3rd Qu.: NA Max. : NA Max. : NA NA's :3 NA's :3 BŁĄD: Stopping Dodatkowo: Komunikat ostrzegawczy: W poleceniu 'nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, ': There were missing values in resampled performance measures.
modelFit3 <- train(ptrTraining$classe ~ .,method="rf", preProcess="pca",
- trControl = ctrl, data=ptrTraining, mtry = 1) Something is wrong; all the Accuracy metric values are missing: Accuracy Kappa Min. : NA Min. : NA 1st Qu.: NA 1st Qu.: NA Median : NA Median : NA Mean :NaN Mean :NaN 3rd Qu.: NA 3rd Qu.: NA Max. : NA Max. : NA NA's :3 NA's :3 BŁĄD: Stopping Dodatkowo: Komunikat ostrzegawczy: W poleceniu 'nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, ': There were missing values in resampled performance measures.
Cannot set any non-trivial lower number of columns.
When it comes to using preProcess to estimate the number of columns:
pp <- preProcess(ptrTraining[,-53],method = "pca") pp Created from 17662 samples and 52 variables
Pre-processing:
PCA needed 25 components to capture 95 percent of the variance
modelFit3 <- train(ptrTraining$classe ~ .,method="rf", preProcess="pca",
- trControl = ctrl, data=ptrTraining, mtry = 25) Something is wrong; all the Accuracy metric values are missing: Accuracy Kappa Min. : NA Min. : NA 1st Qu.: NA 1st Qu.: NA Median : NA Median : NA Mean :NaN Mean :NaN 3rd Qu.: NA 3rd Qu.: NA Max. : NA Max. : NA NA's :3 NA's :3 BŁĄD: Stopping Dodatkowo: Komunikat ostrzegawczy: W poleceniu 'nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, ': There were missing values in resampled performance measures.
As this is a classification case, let’s try the 'would be' default mtry value (sqrt(25) = 5)
modelFit3 <- train(ptrTraining$classe ~ .,method="rf", preProcess="pca",
- trControl = ctrl, data=ptrTraining, mtry = 5) Something is wrong; all the Accuracy metric values are missing: Accuracy Kappa Min. : NA Min. : NA 1st Qu.: NA 1st Qu.: NA Median : NA Median : NA Mean :NaN Mean :NaN 3rd Qu.: NA 3rd Qu.: NA Max. : NA Max. : NA NA's :3 NA's :3 BŁĄD: Stopping Dodatkowo: Komunikat ostrzegawczy: W poleceniu 'nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, ': There were missing values in resampled performance measures.
No issues when using randomForrrest directly.
Did you ran a modified code so that you’ve got that much of a feedback from caret?
BRs,
Mariusz
Wiadomość napisana przez Max Kuhn notifications@github.com w dniu 01.07.2018, o godz. 02:10:
So it looks like the issue is that ranger freaks out if mtry is larger than the number of columns. This is an issue since train determines the number of columns from the original data and you are using PCA.
You should always look at the warnings:
Aggregating results Selecting tuning parameters Fitting mtry = 2 on full training set Error in
[.data.frame
(data, , all.vars(Terms), drop = FALSE) : undefined columns selected In addition: There were 20 warnings (use warnings() to see them)warnings() Warning messages: 1: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range So set the number of columns lower. You can run preProcess on the data to check and see how many columns normally get generated.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/topepo/caret/issues/904#issuecomment-401574354, or mute the thread https://github.com/notifications/unsubscribe-auth/ALrAuH1zgRqAVns3JBxjyCPSouowr_beks5uCBNqgaJpZM4U0EPo.
I’m not getting any warnings related to randomForest when I run the code I’ve provided in this report.
Are you sure? This is one reason that we advise people to run without parallelism when submitting an issue; it can sometimes obscure the warnings. If you look at the output that I showed from running your example, it has "In addition: There were 20 warnings (use warnings() to see them)" so you would have to run that command to see them.
> modelFit3 <- train(ptrTraining$classe ~ .,method="rf", preProcess="pca",
+ trControl = ctrl, data=ptrTraining, mtry = 2)
That's not how you set tuning parameters in caret
. See the documentation.
When set correctly, it worked for me:
> set.seed(728665723)
> modelFit3 <- train(
+ classe ~ .,
+ method = "rf",
+ preProcess = "pca",
+ trControl = ctrl,
+ data = ptrTraining,
+ tuneGrid = data.frame(mtry = 5)
+ )
>
> modelFit3
Random Forest
17662 samples
52 predictor
5 classes: 'A', 'B', 'C', 'D', 'E'
Pre-processing: principal component signal extraction (52), centered (52), scaled (52)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 15898, 15895, 15896, 15896, 15894, 15897, ...
Resampling results:
Accuracy Kappa
0.9787685 0.9731426
Tuning parameter 'mtry' was held constant at a value of 5
Note the difference in the formula syntax too. You shouldn't reference the data frame in the formula; that can sometimes lead to errors.
Yes, I’m sure. The warning appear only when the allowParallel is set to FALSE, but no warnings when set to TRUE. But hey, it’s 21st century :-), every CPU is now multicore.
ctrl <- trainControl(method = "cv", allowParallel = F) set.seed(728665723) modelFit3 <- train(ptrTraining$classe ~ .,method="rf", preProcess="pca",
- trControl = ctrl, data=ptrTraining) Błąd w poleceniu '
[.data.frame
(data, , all.vars(Terms), drop = FALSE)': undefined columns selected Dodatkowo: Było 20 ostrzeżenie (użyj 'warnings()' aby je zobaczyć) warnings() Komunikaty ostrzegawcze: 1: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 2: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 3: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 4: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 5: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 6: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 7: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 8: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 9: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 10: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 11: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 12: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 13: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 14: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 15: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 16: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 17: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 18: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 19: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid range 20: In randomForest.default(x, y, mtry = param$mtry, ...) : invalid mtry: reset to within valid rangeWiadomość napisana przez Max Kuhn notifications@github.com w dniu 01.07.2018, o godz. 18:45:
I’m not getting any warnings related to randomForest when I run the code I’ve provided in this report.
Are you sure? This is one reason that we advise people to run without parallelism when submitting an issue; it can sometimes obscure the warnings. If you look at the output that I showed from running your example, it has "In addition: There were 20 warnings (use warnings() to see them)" so you would have to run that command to see them.
modelFit3 <- train(ptrTraining$classe ~ .,method="rf", preProcess="pca",
- trControl = ctrl, data=ptrTraining, mtry = 2) That's not how you set tuning parameters in caret. See the documentation https://topepo.github.io/caret/model-training-and-tuning.html#alternate-tuning-grids.
When set correctly, it worked for me:
set.seed(728665723) modelFit3 <- train(
- classe ~ .,
- method = "rf",
- preProcess = "pca",
- trControl = ctrl,
- data = ptrTraining,
- tuneGrid = data.frame(mtry = 5)
- )
modelFit3 Random Forest
17662 samples 52 predictor 5 classes: 'A', 'B', 'C', 'D', 'E'
Pre-processing: principal component signal extraction (52), centered (52), scaled (52) Resampling: Cross-Validated (10 fold) Summary of sample sizes: 15898, 15895, 15896, 15896, 15894, 15897, ... Resampling results:
Accuracy Kappa 0.9787685 0.9731426
Tuning parameter 'mtry' was held constant at a value of 5 Note the difference in the formula syntax too. You shouldn't reference the data frame in the formula; that can sometimes lead to errors.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/topepo/caret/issues/904#issuecomment-401618294, or mute the thread https://github.com/notifications/unsubscribe-auth/ALrAuCf8WvEyVDU4wwOcfD7jPHeDHTiRks5uCPyfgaJpZM4U0EPo.
The warning appear only when the allowParallel is set to FALSE, but no warnings when set to TRUE.
That was my point.
Hi Max,
Closed? Whether the allowParallel is set to True or False the error is still present. So what’s the verdict? Not a bug???
BRs,
Mariusz
Wiadomość napisana przez Max Kuhn notifications@github.com w dniu 02.07.2018, o godz. 17:17:
Closed #904 https://github.com/topepo/caret/issues/904.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/topepo/caret/issues/904#event-1711897633, or mute the thread https://github.com/notifications/unsubscribe-auth/ALrAuM5BATjsHp01TPlZb8Jo7wRy65Olks5uCjmKgaJpZM4U0EPo.
Sorry; I thought that my worked example solved it for you.
Not really a bug, just not obvious.
Basically, since you are using PCA, the data that the model receives has less columns than the original. If you don't specify mtry
, train
tries to make a gird for you from the original data (which is documented).
The solution is to set mtry
to a smaller number:
> set.seed(728665723)
> modelFit3 <- train(
+ classe ~ .,
+ method = "rf",
+ preProcess = "pca",
+ trControl = ctrl,
+ data = ptrTraining,
+ tuneGrid = data.frame(mtry = 5)
+ )
But this code worked with caret version which was current two years ago without the need to specify mtry, so it was somehow taking care of proper setting of mtry down the line. Just made a quick scan through train help page - can’t see any hint on that. Do you plan any documentation update to cover this case?
BRs,
Mariusz
Wiadomość napisana przez Max Kuhn notifications@github.com w dniu 02.07.2018, o godz. 22:18:
Sorry; I thought that my worked example solved it for you.
Not really a bug, just not obvious.
Basically, since you are using PCA, the data that the model receives has less columns than the original. If you don't specify mtry, train tries to make a gird for you from the original data (which is documented).
The solution is to set mtry to a smaller number:
set.seed(728665723) modelFit3 <- train(
- classe ~ .,
- method = "rf",
- preProcess = "pca",
- trControl = ctrl,
- data = ptrTraining,
- tuneGrid = data.frame(mtry = 5)
- ) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/topepo/caret/issues/904#issuecomment-401921292, or mute the thread https://github.com/notifications/unsubscribe-auth/ALrAuOUYFfmWKPF62gN7fJU2vuVfGuWXks5uCn_4gaJpZM4U0EPo.
Hi, I am getting the small error, but I am using the lm method
`data.excel <- read_excel("Methotrexate and Vincristine Synergy Database.xlsx")
data.trial <-data.frame(data.excel)
synergy <- data.trial[,8]
metho.dose.conc <- data.trial[, 4]
vinc.dose.conc <- data.trial[, 7]
data.for.synergy.1 <- data.trial[, c(4,7,8)]
model <- train(synergy ~ metho.dose.conc + vinc.dose.conc, data.for.synergy.1, method= "lm", trControl = trainControl(method = 'cv', number =10, verboseIter = TRUE, savePredictions = TRUE, classProbs = TRUE))`
Hi,
I'm trying to re-run the script which was working perfectly for me 2 years ago, but now I'm getting this error from train. Using caret 6.0-80, reinstalled from GitHub today.
Minimal dataset:
Minimal, runnable code:
Session Info:
R version 3.4.4 (2018-03-15) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS High Sierra 10.13.5
Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale: [1] pl_PL.UTF-8/pl_PL.UTF-8/pl_PL.UTF-8/C/pl_PL.UTF-8/pl_PL.UTF-8
attached base packages: [1] parallel stats graphics grDevices utils datasets methods base
other attached packages: [1] caret_6.0-80 doMC_1.3.5 iterators_1.0.9 foreach_1.4.4 randomForest_4.6-14 [6] pander_0.6.1 lattice_0.20-35 ggplot2_2.2.1
loaded via a namespace (and not attached): [1] Rcpp_0.12.17 lubridate_1.7.4 tidyr_0.8.1 class_7.3-14 assertthat_0.2.0
[6] digest_0.6.15 ipred_0.9-6 psych_1.8.4 R6_2.2.2 plyr_1.8.4
[11] magic_1.5-8 stats4_3.4.4 e1071_1.6-8 httr_1.3.1 pillar_1.2.3
[16] rlang_0.2.1 curl_3.2 lazyeval_0.2.1 kernlab_0.9-26 rpart_4.1-13
[21] Matrix_1.2-14 devtools_1.13.5 splines_3.4.4 CVST_0.2-2 ddalpha_1.3.3
[26] gower_0.1.2 stringr_1.3.1 foreign_0.8-70 munsell_0.5.0 broom_0.4.4
[31] compiler_3.4.4 pkgconfig_2.0.1 mnormt_1.5-5 dimRed_0.1.0 nnet_7.3-12
[36] tidyselect_0.2.4 tibble_1.4.2 prodlim_2018.04.18 DRR_0.0.3 codetools_0.2-15
[41] RcppRoll_0.3.0 dplyr_0.7.5 withr_2.1.2 MASS_7.3-50 recipes_0.1.3
[46] ModelMetrics_1.1.0 grid_3.4.4 nlme_3.1-137 gtable_0.2.0 git2r_0.21.0.9000 [51] magrittr_1.5 scales_0.5.0 stringi_1.2.3 reshape2_1.4.3 bindrcpp_0.2.2
[56] timeDate_3043.102 robustbase_0.93-0 geometry_0.3-6 lava_1.6.1 tools_3.4.4
[61] glue_1.2.0 DEoptimR_1.0-8 purrr_0.2.5 sfsmisc_1.1-2 abind_1.4-5
[66] survival_2.42-3 yaml_2.1.19 colorspace_1.3-2 memoise_1.1.0 knitr_1.20
[71] bindr_0.1.1