topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.61k stars 634 forks source link

caret seems to be really slow #108

Closed blackfist closed 9 years ago

blackfist commented 9 years ago

I'm not sure what I'm doing wrong, but it seems that training a model using randomForest is much much faster if I just use the randomForest() function rather than train() I'm using R version 3.1.2 and the latest version of caret available on CRAN. Here is a reproducible example, taken from some of the Johns Hopkins Data Science specialization track on Coursera.

if (!file.exists("pml-training.csv")) {
  download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", "pml-training.csv", method="curl")
}
if (!file.exists("pml-testing.csv")) {
  download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", "pml-testing.csv", method="curl")
}
set.seed(75681)

library(caret)
library("randomForest")
training <- read.csv("pml-training.csv")
testing <- read.csv("pml-testing.csv")

# Remmove the times and user_name
training <- training[,-c(1,2,3,4,5)]
testing <- testing[,-c(1,2,3,4,5)]

# remove the columsn with near zero variance. I remove the same columns from the testing set as were taken from the training set to
# ensure that the two data sets are consistent
nzv <- nearZeroVar(training)
training <- training[,-nzv]
testing <- testing[,-nzv]

# There are a lot of columns that are NA for every value. Let's just remove them since most
# machine learning algorithms don't like NA values anyway.
mostly_na <- apply(training, 2, function(x) { sum(is.na(x)) } )
training <- training[,which(mostly_na==0)]
testing <- testing[,which(mostly_na==0)]

# Since I'm slashing columns out left and right, why not take out any 
# highly correlated variables
muchCor <- findCorrelation(cor(training[, 1:dim(training)[2]-1]), cutoff=0.8)
training <- training[,-muchCor]
testing <- testing[,-muchCor]

# Finally, we will only keep the complete cases from the two data sets
# instead of removing these cases we might have to do some of that impugning data.
training <- training[complete.cases(training),]
testing <- testing[complete.cases(testing),]

# Now divide the training set up into modelTraining and modelTesting
randomSelection <- createDataPartition(training$classe, p = 0.7, list = FALSE)
modelTraining <- training[randomSelection, ]
modelTesting <- training[-randomSelection, ]

###
### Now it's time to train some models and see how they perform ###
start <- Sys.time()
rfModel <- randomForest(modelTraining[,1:ncol(modelTraining)-1], modelTraining[,ncol(modelTraining)])
Sys.time() - start

That last line returned 30.19 seconds of wall time to finish training the model. Then I ran this

start <- Sys.time()
rfModel <- train(classe ~ ., data=modelTraining, method="rf", verbose=FALSE)
Sys.time() - start

I finally stopped running it after 12 minutes had passed. I don't know how long it would have gone on for. Obviously this is a really big different in run time and I'm wondering if I'm doing something wrong in the way I'm calling train() or if there is an actual bug somewhere?

zachmayer commented 9 years ago

By default, caret fits the model 76 times:  it chooses 3 values of mtry and fits each one to 25 bootstrap samples of the dataset.  It them evaluates the bootstrap samples, chooses a value of mtry, and fits a final model using this value.

— Sent from Mailbox

On Sun, Jan 25, 2015 at 12:24 AM, Kevin Thompson notifications@github.com wrote:

I'm not sure what I'm doing wrong, but it seems that training a model using randomForest is much much faster if I just use the randomForest() function rather than train() I'm using R version 3.1.2 and the latest version of caret available on CRAN. Here is a reproducible example, taken from some of the Johns Hopkins Data Science specialization track on Coursera.

if (!file.exists("pml-training.csv")) {
  download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", "pml-training.csv", method="curl")
}
if (!file.exists("pml-testing.csv")) {
  download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", "pml-testing.csv", method="curl")
}
set.seed(75681)
library(caret)
library("randomForest")
training <- read.csv("pml-training.csv")
testing <- read.csv("pml-testing.csv")
# Remmove the times and user_name
training <- training[,-c(1,2,3,4,5)]
testing <- testing[,-c(1,2,3,4,5)]
# remove the columsn with near zero variance. I remove the same columns from the testing set as were taken from the training set to
# ensure that the two data sets are consistent
nzv <- nearZeroVar(training)
training <- training[,-nzv]
testing <- testing[,-nzv]
# There are a lot of columns that are NA for every value. Let's just remove them since most
# machine learning algorithms don't like NA values anyway.
mostly_na <- apply(training, 2, function(x) { sum(is.na(x)) } )
training <- training[,which(mostly_na==0)]
testing <- testing[,which(mostly_na==0)]
# Since I'm slashing columns out left and right, why not take out any 
# highly correlated variables
muchCor <- findCorrelation(cor(training[, 1:dim(training)[2]-1]), cutoff=0.8)
training <- training[,-muchCor]
testing <- testing[,-muchCor]
# Finally, we will only keep the complete cases from the two data sets
# instead of removing these cases we might have to do some of that impugning data.
training <- training[complete.cases(training),]
testing <- testing[complete.cases(testing),]
# Now divide the training set up into modelTraining and modelTesting
randomSelection <- createDataPartition(training$classe, p = 0.7, list = FALSE)
modelTraining <- training[randomSelection, ]
modelTesting <- training[-randomSelection, ]
###
### Now it's time to train some models and see how they perform ###
start <- Sys.time()
rfModel <- randomForest(modelTraining[,1:ncol(modelTraining)-1], modelTraining[,ncol(modelTraining)])
Sys.time() - start

That last line returned 30.19 seconds of wall time to finish training the model. Then I ran this

start <- Sys.time()
rfModel <- train(classe ~ ., data=modelTraining, method="rf", verbose=FALSE)
Sys.time() - start

I finally stopped running it after 12 minutes had passed. I don't know how long it would have gone on for. Obviously this is a really big different in run time and I'm wondering if I'm doing something wrong in the way I'm calling train() or if there is an actual bug somewhere?

Reply to this email directly or view it on GitHub: https://github.com/topepo/caret/issues/108

topepo commented 9 years ago

Yes. You are comparing the weight of a single apple to that of a bag of apples. Use type = "none" in trainControl to make an appropriate comparison.

blackfist commented 9 years ago

Thank you!

enitihas commented 7 years ago

@topepo I am assuming you meant method="none". Using method="none" gives me the following error: Error in train.default(x, y, weights = w, ...) : Only one model should be specified in tuneGrid with no resampling

R version: 3.2.3 caret version: 6.0-73

topepo commented 7 years ago

@enitihas if you just use randomForest, it fits one model. If you use train, it will fit more models based on the product of tuneLength (or nrow(tuneGrid)) * the number of resamples. If you don't resample, then you can't tune.

"Only one model should be specified in tuneGrid with no resampling" means that you should either set tuneLength = 1 or use tuneGrid = data.frame(mtry = XXXXX) where XXXXX is some appropriate number.

enitihas commented 7 years ago

Thank You. Now I understand what was going on.