topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models
http://topepo.github.io/caret/index.html
1.62k stars 632 forks source link

Caret Training Time Inconsistencies for Different Methods #806

Closed murtaza-nasir closed 6 years ago

murtaza-nasir commented 6 years ago

I have been training a few different models, and it seems that different methods take wildly different training times depending upon the training function used. For example, training an svmRadial model using train with 5 x 20 fold repeated cross validation with DoParallel takes about 14 seconds. Doing the same with rbf takes around 25 seconds.

With gafs, 5 fold internal and external cross validation, as well as DoParallel, svmRadial took 2.5 hours, which can be expected when comparing 5x20=400 models completed by train with the population size and iterations for gafs. But rbf took 2 days and 7 hours with gafs, which is much much more than expected compared to train. Similarly, with a custom implementation of particle swarm optimization that uses caret::train for training the models, svmRadial took 50 minutes (again, as could be expected from the PSO parameter space), but rbf has been running for more than 30 hours and still hasn't completed.

Can anyone provide pointers on how to optimize training speeds or whether some methods work better than others within these training methods. My setup, data type and code have been provided here: https://github.com/topepo/caret/issues/805.

Thanks, Murtaza

topepo commented 6 years ago

If you are using the default parameters of gafs, you are fitting a #%^&-ton of models:

External resampling: 5 x 20 
  Iterations: 10
    Generations: 50
       Internal resampling: 5 x 20
           Tuning parameters: you didn't say, so T

In all, you are fitting 400^2 * 500 * T models. Some of this can be done in parallel and you should be very careful that you don't do parallelism at both levels so that you don't square the number of workers that are being spawned.

murtaza-nasir commented 6 years ago

Thank you for your reply. The actual code I'm using is this:

  cl<-makeCluster(detectCores())
  registerDoParallel(cl)
  rbf_grid<-expand.grid(size = c(12))
  gacontrol<-gafsControl(method = "cv", 
                         number = 5, 
                         allowParallel = TRUE, 
                         genParallel = TRUE,
                         functions = caretGA                         
  )
  trControl = trainControl(allowParallel = T, 
                           method = "cv", 
                           number = 5          
  )
  rbfgafit<-gafs(data[,-1],data[,1], 
                  popSize = 50, 
                  iters = 100, 
                  pcrossover = 0.7, 
                  pmutation = 0.2, 
                  gafsControl = gacontrol, 
                  method="rbf", 
                  maxit=1000, 
                  preProcess = c("scale","center"),
                  trControl = trControl,
                  tuneGrid = rbf_grid
  )
  stopCluster(cl)

I'm not using repeatedcv for GA. So its more like:

5 external
 100 generations
    50 population size
      5 internal
       1 tuning parameter

Which is 125000 models. So compared to the 5 * 20 = 100 models (dont know why I wrote 400 in the original post), it should take 125000 / 100 * 45 seconds, or 15.5 hours. But it took 55 hours. With the exact same gafs setup, svmRadial took just 2.5 hours.

I'm just wondering if some code in rbf fails to get parallelized when used within another function like gafs. For example, with particle swarm optimization, another wrapper feature selection function I changed to use caret::train, svmRadial took 50 minutes but rbf took 37.5 hours.

caretPSO2 <- list(
  fit = function(x, y, ...) {
    mod <- caret::train(x, y, "rbf", maxit=2000, preProcess = c("scale","center"),
                        tuneGrid = expand.grid(size = c(21)),
                        trControl = caret::trainControl(method = "cv",
                                                        number = 5,
                                                        allowParallel = T, 
                                                        savePredictions = T
                        )
    )
    mod
  },
  fitness = function(object, x, y)
  {
    caret:::getTrainPerf(object)[, paste("Train", object$metric, sep = "")]
  },
  predict = function(object, x)
  {
    predict(object, newdata = x)
  }
)
  cl<-makeCluster(detectCores())
  registerDoParallel(cl)
  svmpsofit<-psofs(x = data[,-1],
                   y = data[,1],
                   iterations = 300,
                   functions = caretPSO2,
                   verbose = TRUE,
                   parallel = TRUE
                   )
  stopCluster(cl)

PS: You mentioned I shouldn't parallelize on both levels, and I've read this in the caret documentation too. But as you can see in my code, I have. Does that affect the performance adversely?

murtaza-nasir commented 6 years ago

OK I think I may be wrong. It seems the numbers turn out fine, or rather, better than expected. I reran a model with the same data with just 5 fold cv and it took 19 seconds meaning approx 3 seconds per model.