caret seems to be really slow

blackfist commented 9 years ago

I'm not sure what I'm doing wrong, but it seems that training a model using randomForest is much much faster if I just use the randomForest() function rather than train() I'm using R version 3.1.2 and the latest version of caret available on CRAN. Here is a reproducible example, taken from some of the Johns Hopkins Data Science specialization track on Coursera.

if (!file.exists("pml-training.csv")) {
  download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", "pml-training.csv", method="curl")
}
if (!file.exists("pml-testing.csv")) {
  download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", "pml-testing.csv", method="curl")
}
set.seed(75681)

library(caret)
library("randomForest")
training <- read.csv("pml-training.csv")
testing <- read.csv("pml-testing.csv")

# Remmove the times and user_name
training <- training[,-c(1,2,3,4,5)]
testing <- testing[,-c(1,2,3,4,5)]

# remove the columsn with near zero variance. I remove the same columns from the testing set as were taken from the training set to
# ensure that the two data sets are consistent
nzv <- nearZeroVar(training)
training <- training[,-nzv]
testing <- testing[,-nzv]

# There are a lot of columns that are NA for every value. Let's just remove them since most
# machine learning algorithms don't like NA values anyway.
mostly_na <- apply(training, 2, function(x) { sum(is.na(x)) } )
training <- training[,which(mostly_na==0)]
testing <- testing[,which(mostly_na==0)]

# Since I'm slashing columns out left and right, why not take out any 
# highly correlated variables
muchCor <- findCorrelation(cor(training[, 1:dim(training)[2]-1]), cutoff=0.8)
training <- training[,-muchCor]
testing <- testing[,-muchCor]

# Finally, we will only keep the complete cases from the two data sets
# instead of removing these cases we might have to do some of that impugning data.
training <- training[complete.cases(training),]
testing <- testing[complete.cases(testing),]

# Now divide the training set up into modelTraining and modelTesting
randomSelection <- createDataPartition(training$classe, p = 0.7, list = FALSE)
modelTraining <- training[randomSelection, ]
modelTesting <- training[-randomSelection, ]

###
### Now it's time to train some models and see how they perform ###
start <- Sys.time()
rfModel <- randomForest(modelTraining[,1:ncol(modelTraining)-1], modelTraining[,ncol(modelTraining)])
Sys.time() - start

That last line returned 30.19 seconds of wall time to finish training the model. Then I ran this

start <- Sys.time()
rfModel <- train(classe ~ ., data=modelTraining, method="rf", verbose=FALSE)
Sys.time() - start

I finally stopped running it after 12 minutes had passed. I don't know how long it would have gone on for. Obviously this is a really big different in run time and I'm wondering if I'm doing something wrong in the way I'm calling train() or if there is an actual bug somewhere?

zachmayer commented 9 years ago

By default, caret fits the model 76 times: it chooses 3 values of mtry and fits each one to 25 bootstrap samples of the dataset. It them evaluates the bootstrap samples, chooses a value of mtry, and fits a final model using this value.

— Sent from Mailbox

On Sun, Jan 25, 2015 at 12:24 AM, Kevin Thompson notifications@github.com wrote:

I'm not sure what I'm doing wrong, but it seems that training a model using randomForest is much much faster if I just use the randomForest() function rather than train() I'm using R version 3.1.2 and the latest version of caret available on CRAN. Here is a reproducible example, taken from some of the Johns Hopkins Data Science specialization track on Coursera.

if (!file.exists("pml-training.csv")) {
  download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", "pml-training.csv", method="curl")
}
if (!file.exists("pml-testing.csv")) {
  download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", "pml-testing.csv", method="curl")
}
set.seed(75681)
library(caret)
library("randomForest")
training <- read.csv("pml-training.csv")
testing <- read.csv("pml-testing.csv")
# Remmove the times and user_name
training <- training[,-c(1,2,3,4,5)]
testing <- testing[,-c(1,2,3,4,5)]
# remove the columsn with near zero variance. I remove the same columns from the testing set as were taken from the training set to
# ensure that the two data sets are consistent
nzv <- nearZeroVar(training)
training <- training[,-nzv]
testing <- testing[,-nzv]
# There are a lot of columns that are NA for every value. Let's just remove them since most
# machine learning algorithms don't like NA values anyway.
mostly_na <- apply(training, 2, function(x) { sum(is.na(x)) } )
training <- training[,which(mostly_na==0)]
testing <- testing[,which(mostly_na==0)]
# Since I'm slashing columns out left and right, why not take out any 
# highly correlated variables
muchCor <- findCorrelation(cor(training[, 1:dim(training)[2]-1]), cutoff=0.8)
training <- training[,-muchCor]
testing <- testing[,-muchCor]
# Finally, we will only keep the complete cases from the two data sets
# instead of removing these cases we might have to do some of that impugning data.
training <- training[complete.cases(training),]
testing <- testing[complete.cases(testing),]
# Now divide the training set up into modelTraining and modelTesting
randomSelection <- createDataPartition(training$classe, p = 0.7, list = FALSE)
modelTraining <- training[randomSelection, ]
modelTesting <- training[-randomSelection, ]
###
### Now it's time to train some models and see how they perform ###
start <- Sys.time()
rfModel <- randomForest(modelTraining[,1:ncol(modelTraining)-1], modelTraining[,ncol(modelTraining)])
Sys.time() - start

That last line returned 30.19 seconds of wall time to finish training the model. Then I ran this

start <- Sys.time()
rfModel <- train(classe ~ ., data=modelTraining, method="rf", verbose=FALSE)
Sys.time() - start

I finally stopped running it after 12 minutes had passed. I don't know how long it would have gone on for. Obviously this is a really big different in run time and I'm wondering if I'm doing something wrong in the way I'm calling `train()` or if there is an actual bug somewhere?

Reply to this email directly or view it on GitHub: https://github.com/topepo/caret/issues/108

topepo commented 9 years ago

Yes. You are comparing the weight of a single apple to that of a bag of apples. Use type = "none" in trainControl to make an appropriate comparison.

blackfist commented 9 years ago

Thank you!

enitihas commented 7 years ago

@topepo I am assuming you meant method="none". Using method="none" gives me the following error: Error in train.default(x, y, weights = w, ...) : Only one model should be specified in tuneGrid with no resampling

R version: 3.2.3 caret version: 6.0-73

topepo commented 7 years ago

@enitihas if you just use randomForest, it fits one model. If you use train, it will fit more models based on the product of tuneLength (or nrow(tuneGrid)) * the number of resamples. If you don't resample, then you can't tune.

"Only one model should be specified in tuneGrid with no resampling" means that you should either set tuneLength = 1 or use tuneGrid = data.frame(mtry = XXXXX) where XXXXX is some appropriate number.

enitihas commented 7 years ago

Thank You. Now I understand what was going on.

topepo / caret