parallelizing preprocess

topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models

http://topepo.github.io/caret/index.html

1.62k stars 632 forks source link

parallelizing preprocess #449

Open spedygiorgio opened 8 years ago

spedygiorgio commented 8 years ago

It could be useful to add a parallel backed to preprocess... Operations can be parallelized throught columns and it could be helpful in estimation and prediction expecially when the data set has many features.

topepo commented 8 years ago

The main issue with doing this is that it would multiply the number of workers used. For example, if you requested M cores, most of the parallel processing technologies will end up using M² because of the nested structure of the calls.

This also happens with some models that can be run in parallel (e.g. ranger). That's generally why I have avoided it.

I have some changes upcoming to preProcess that might mitigate some of these issues; you can pick subsets of predictors for specific methods (instead of having to do them all).

HenrikBengtsson commented 7 years ago

FYI, the doFuture backend, or actually the future framework, automatically protects against such nested parallelism that otherwise would "blow up", so using doFuture would be safe in this sense. Moreover, users that got access to compute clusters can utilize such nested processing by using an explicit, nested future strategies, e.g.

library("future.batchtools")
plan(list(batchjobs_sge, multiprocess))