Closed blackfist closed 9 years ago
By default, caret fits the model 76 times: it chooses 3 values of mtry and fits each one to 25 bootstrap samples of the dataset. It them evaluates the bootstrap samples, chooses a value of mtry, and fits a final model using this value.
— Sent from Mailbox
On Sun, Jan 25, 2015 at 12:24 AM, Kevin Thompson notifications@github.com wrote:
I'm not sure what I'm doing wrong, but it seems that training a model using randomForest is much much faster if I just use the
randomForest()
function rather thantrain()
I'm using R version 3.1.2 and the latest version of caret available on CRAN. Here is a reproducible example, taken from some of the Johns Hopkins Data Science specialization track on Coursera.if (!file.exists("pml-training.csv")) { download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", "pml-training.csv", method="curl") } if (!file.exists("pml-testing.csv")) { download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", "pml-testing.csv", method="curl") } set.seed(75681) library(caret) library("randomForest") training <- read.csv("pml-training.csv") testing <- read.csv("pml-testing.csv") # Remmove the times and user_name training <- training[,-c(1,2,3,4,5)] testing <- testing[,-c(1,2,3,4,5)] # remove the columsn with near zero variance. I remove the same columns from the testing set as were taken from the training set to # ensure that the two data sets are consistent nzv <- nearZeroVar(training) training <- training[,-nzv] testing <- testing[,-nzv] # There are a lot of columns that are NA for every value. Let's just remove them since most # machine learning algorithms don't like NA values anyway. mostly_na <- apply(training, 2, function(x) { sum(is.na(x)) } ) training <- training[,which(mostly_na==0)] testing <- testing[,which(mostly_na==0)] # Since I'm slashing columns out left and right, why not take out any # highly correlated variables muchCor <- findCorrelation(cor(training[, 1:dim(training)[2]-1]), cutoff=0.8) training <- training[,-muchCor] testing <- testing[,-muchCor] # Finally, we will only keep the complete cases from the two data sets # instead of removing these cases we might have to do some of that impugning data. training <- training[complete.cases(training),] testing <- testing[complete.cases(testing),] # Now divide the training set up into modelTraining and modelTesting randomSelection <- createDataPartition(training$classe, p = 0.7, list = FALSE) modelTraining <- training[randomSelection, ] modelTesting <- training[-randomSelection, ] ### ### Now it's time to train some models and see how they perform ### start <- Sys.time() rfModel <- randomForest(modelTraining[,1:ncol(modelTraining)-1], modelTraining[,ncol(modelTraining)]) Sys.time() - start
That last line returned 30.19 seconds of wall time to finish training the model. Then I ran this
start <- Sys.time() rfModel <- train(classe ~ ., data=modelTraining, method="rf", verbose=FALSE) Sys.time() - start
I finally stopped running it after 12 minutes had passed. I don't know how long it would have gone on for. Obviously this is a really big different in run time and I'm wondering if I'm doing something wrong in the way I'm calling
train()
or if there is an actual bug somewhere?Reply to this email directly or view it on GitHub: https://github.com/topepo/caret/issues/108
Yes. You are comparing the weight of a single apple to that of a bag of apples. Use type = "none"
in trainControl
to make an appropriate comparison.
Thank you!
@topepo I am assuming you meant method="none". Using method="none" gives me the following error: Error in train.default(x, y, weights = w, ...) : Only one model should be specified in tuneGrid with no resampling
R version: 3.2.3 caret version: 6.0-73
@enitihas if you just use randomForest
, it fits one model. If you use train
, it will fit more models based on the product of tuneLength
(or nrow(tuneGrid)
) * the number of resamples. If you don't resample, then you can't tune.
"Only one model should be specified in tuneGrid with no resampling" means that you should either set tuneLength = 1
or use tuneGrid = data.frame(mtry = XXXXX)
where XXXXX
is some appropriate number.
Thank You. Now I understand what was going on.
I'm not sure what I'm doing wrong, but it seems that training a model using randomForest is much much faster if I just use the
randomForest()
function rather thantrain()
I'm using R version 3.1.2 and the latest version of caret available on CRAN. Here is a reproducible example, taken from some of the Johns Hopkins Data Science specialization track on Coursera.That last line returned 30.19 seconds of wall time to finish training the model. Then I ran this
I finally stopped running it after 12 minutes had passed. I don't know how long it would have gone on for. Obviously this is a really big different in run time and I'm wondering if I'm doing something wrong in the way I'm calling
train()
or if there is an actual bug somewhere?