Closed murtaza-nasir closed 6 years ago
If you are using the default parameters of gafs
, you are fitting a #%^&-ton of models:
External resampling: 5 x 20
Iterations: 10
Generations: 50
Internal resampling: 5 x 20
Tuning parameters: you didn't say, so T
In all, you are fitting 400^2 * 500 * T
models. Some of this can be done in parallel and you should be very careful that you don't do parallelism at both levels so that you don't square the number of workers that are being spawned.
Thank you for your reply. The actual code I'm using is this:
cl<-makeCluster(detectCores())
registerDoParallel(cl)
rbf_grid<-expand.grid(size = c(12))
gacontrol<-gafsControl(method = "cv",
number = 5,
allowParallel = TRUE,
genParallel = TRUE,
functions = caretGA
)
trControl = trainControl(allowParallel = T,
method = "cv",
number = 5
)
rbfgafit<-gafs(data[,-1],data[,1],
popSize = 50,
iters = 100,
pcrossover = 0.7,
pmutation = 0.2,
gafsControl = gacontrol,
method="rbf",
maxit=1000,
preProcess = c("scale","center"),
trControl = trControl,
tuneGrid = rbf_grid
)
stopCluster(cl)
I'm not using repeatedcv
for GA. So its more like:
5 external
100 generations
50 population size
5 internal
1 tuning parameter
Which is 125000 models. So compared to the 5 * 20 = 100
models (dont know why I wrote 400 in the original post), it should take 125000 / 100 * 45
seconds, or 15.5 hours. But it took 55 hours. With the exact same gafs
setup, svmRadial
took just 2.5 hours.
I'm just wondering if some code in rbf
fails to get parallelized when used within another function like gafs
. For example, with particle swarm optimization, another wrapper feature selection function I changed to use caret::train
, svmRadial
took 50 minutes but rbf
took 37.5 hours.
caretPSO2 <- list(
fit = function(x, y, ...) {
mod <- caret::train(x, y, "rbf", maxit=2000, preProcess = c("scale","center"),
tuneGrid = expand.grid(size = c(21)),
trControl = caret::trainControl(method = "cv",
number = 5,
allowParallel = T,
savePredictions = T
)
)
mod
},
fitness = function(object, x, y)
{
caret:::getTrainPerf(object)[, paste("Train", object$metric, sep = "")]
},
predict = function(object, x)
{
predict(object, newdata = x)
}
)
cl<-makeCluster(detectCores())
registerDoParallel(cl)
svmpsofit<-psofs(x = data[,-1],
y = data[,1],
iterations = 300,
functions = caretPSO2,
verbose = TRUE,
parallel = TRUE
)
stopCluster(cl)
PS: You mentioned I shouldn't parallelize on both levels, and I've read this in the caret documentation too. But as you can see in my code, I have. Does that affect the performance adversely?
OK I think I may be wrong. It seems the numbers turn out fine, or rather, better than expected. I reran a model with the same data with just 5 fold cv and it took 19 seconds meaning approx 3 seconds per model.
I have been training a few different models, and it seems that different methods take wildly different training times depending upon the training function used. For example, training an
svmRadial
model using train with 5 x 20 fold repeated cross validation withDoParallel
takes about 14 seconds. Doing the same withrbf
takes around 25 seconds.With
gafs
, 5 fold internal and external cross validation, as well asDoParallel
,svmRadial
took 2.5 hours, which can be expected when comparing 5x20=400 models completed bytrain
with the population size and iterations forgafs
. Butrbf
took 2 days and 7 hours withgafs
, which is much much more than expected compared totrain
. Similarly, with a custom implementation of particle swarm optimization that usescaret::train
for training the models,svmRadial
took 50 minutes (again, as could be expected from the PSO parameter space), butrbf
has been running for more than 30 hours and still hasn't completed.Can anyone provide pointers on how to optimize training speeds or whether some methods work better than others within these training methods. My setup, data type and code have been provided here: https://github.com/topepo/caret/issues/805.
Thanks, Murtaza