tidymodels / recipes

Pipeable steps for feature engineering and data preprocessing to prepare for modeling
https://recipes.tidymodels.org
Other
554 stars 106 forks source link

parallelization #48

Open topepo opened 7 years ago

topepo commented 7 years ago

There are some places where the operations might be costly and are embarrassingly parallel. A parallel option might be a good idea here but might bite the user if the recipe is processed inside of another worker process, causing exponential number of threads/workers/processes.

A similar issues exists inside of caret topepo/caret#449.

kylegilde commented 3 years ago

Hi Max, I love the recipes package, but I was wondering if there is any possibility of parallelizing its operations. Thanks

topepo commented 3 years ago

The steps cannot be parallelized since they are serial. Within a step it is possible but there are few steps that are computational expensive enough to warrant it. Some of those already do (e.g. those that use Stan, xgboost, and so on).

In general, unless you are fitting a single model/recipe, it is better to parallelize the loops around the recipe prep and model fit (that's what tune does).

kylegilde commented 3 years ago

I would like to parallelize the column prepping & baking within a step.

Below is a small, step_window example of what I'm trying to do on a significantly larger dataset. I was hoping there was a way to pass each numeric column to its own thread. In general, is there anything I can do to speed up a single step being applied to multiple columns?

Thank you!

library(nycflights13)
library(recipes)

system.time({

  rec <-
    flights %>%
    recipes::recipe(as.formula(" ~ .")) %>%                                            # initialize a recipe, a step of instructions
    recipes::step_window(all_numeric(),
                         size = 11111,
                         role = "predictor") %>% 
    recipes::prep()

})
   user  system elapsed 
263.517   0.000 263.560 

system.time({
  transformed_df <- recipes::bake(rec, new_data = flights)
})

   user  system elapsed 
262.812   0.000 262.847
DavisVaughan commented 3 years ago

If this is mainly about step-window being slow, I agree that it is slow for wide window sizes as seen here. RcppRoll is used under the hood, and it fully evaluates each window, which is fast when they are narrow, but very slow as they get wide.

I just added specialized window functions to slider, like slide_mean() and slide_sum(). They are in the development version right now, but they are much faster than RcppRoll with wide windows. You could use those in combination with a step_mutate_at(). It doesn't do exactly the same thing as step_window(), but it is pretty close. You could also use data.table::frollmean(algo = "fast"), but be warned that there is a chance it could have some numerical instabilities if you have a very wide range of values.

By just using before (and not after), you can also create the "lagging" window you requested in https://github.com/tidymodels/recipes/issues/578

library(nycflights13)
library(recipes)
# devtools::install_github("DavisVaughan/slider")
library(slider)

rec <- flights %>%
  recipe(as.formula(" ~ .")) %>%                                            
  step_mutate_at(
    all_numeric(), 
    fn = ~slide_mean(
      x = .x, 
      before = 5555, 
      after = 5555, 
      complete = FALSE, 
      na_rm = TRUE
    )
  )

system.time({
  recipes::prep(rec, flights)
})
#>    user  system elapsed 
#>   0.814   0.022   0.838

Created on 2020-10-03 by the reprex package (v0.3.0.9001)

kylegilde commented 3 years ago

@DavisVaughan , Thank you for your suggestions. I will investigate and will probably end up going with one of them.

@topepo , I'm wondering if the recipes package has any plans to cover these column-wise embarrassingly parallel processes. Thank you

UnclAlDeveloper commented 3 years ago

Julia. With regard to imputation steps, is it not possible to parallize it by running a separate thread for each variable? You said their were issues so I do understand there may be things I am not aware of.

topepo commented 3 years ago

The general probably in parallelizing these operations is related to how they are used. If you are going to fit a single model, it makes a lot of sense to parallelize them.

However, most of our infrastructure is centered around resampling (for very good reasons). If you are fitting more than once, parallelizing the details is a bad way to go (meaning small speed-ups). Additionally, if you accidentally parallelize multiple loops, you could exponentially increase the number of worker processes (which is bad, esp on windows).

A lot of our choices are based on how much damage we might do if we enable potentially conflicting options. We might need to preclude* some things that cause issues to keep the whole running smoothly.

We also balance implementation costs. For example, for bagged imputation, we use the ipred package, so we could not parallelize the bagging loop without re-writing the whole thing.



* Since it is all open-source, we don't really preclude anything. We have a lot of documentation on writing your own recipe steps and you can create a parallel version of something that we've put together. Parallelizing bagged imputation across variables would be a good example of this.

(edit - wrong words)

UnclAlDeveloper commented 3 years ago

I use two steps with caret and and now tidymodels, the preprocessing step, which includes imputation, and the modelling step. I assume this is the way tidymodels will often be used. So the preprocessing step isn’t resampled. BagImpute seems like a very powerful form of imputation. But it is also very slow. Just trying to impute on 1% of a single data source (so 26K rows of 20 variables) takes about half an hour. It would be 90 seconds in parallel, or 900 seconds if I did it on 10% of the data source, which is realistically what I will be using in the final model.

I understand that I could try to write it myself. I have written multiple custom recipes and parsnips. The problem is that tidymodels is changing very quickly, and every time a new update comes out I am not sure if my customisations will break. Theyhave several times, and trying to debug the reason is quite difficult. Something that is not part of the core product is very difficult to maintain unless you have a deep understanding of the core product and knowledge of changes being made to it. For example, just today, my preprocessing step broke. Ironically on the step_bagimpute. I assume that the vars parameter has been deprecated or renamed, but code that worked a week ago stopped working today. It was fixed easily enough, but a reminder that designing code around a rapidly developing package is risky.

All that said, I do want to emphasise how much I appreciate what you and your team has achieved, and is achieving, both with caret and tidymodels.

topepo commented 3 years ago

I use two steps with caret and and now tidymodels, the preprocessing step, which includes imputation, and the modelling step. I assume this is the way tidymodels will often be used. So the preprocessing step isn’t resampled.

So you are pre-imputing? If you are giving a recipe to caret or tune, it is being re-prepped (deliberately no way to get around that).

If you are pre-imputing, you run a huge risk of the resampling statistics being artificially optimistic (perhaps by a lot). We advise to avoid this at all costs.

But it is also very slow. Just trying to impute on 1% of a single data source (so 26K rows of 20 variables) takes about half an hour. It would be 90 seconds in parallel, or 900 seconds if I did it on 10% of the data source, which is realistically what I will be using in the final model.

That doesn't surprise me. Bagging always fits the deepest tree and, with that data size, the trees will be very large.

In other words, the computations can always be faster, but you have chosen an imputation model that is most susceptible to being inefficient (relative to outer imputation methods). If efficiency is important, bagging is the wrong too for the job her.

All that said, I do want to emphasise how much I appreciate what you and your team has achieved, and is achieving, both with caret and tidymodels.

Great to hear. Thanks!

UnclAlDeveloper commented 3 years ago

If you are pre-imputing, you run a huge risk of the resampling statistics being artificially optimistic (perhaps by a lot). We advise to avoid this at all costs.

Can you expand on this?

I appreciate that it would be ideal to include the imputation step in the resamples, but not practical because of the time it takes.

Bagimpute is really useful for highly correlated inputs. Which is a lot of what I have (I use lots of protection against collinearity) If a linear model decides that y = x1 + x2, but x2 is missing, imputing x2 with the mean or zero, would cause very poor predictions. Imputing x2 from x1 gives fairly reasonable results. If x2 is independent of x1 it usually gets predicted at its mean, so doesn’t really effect the y prediction.

topepo commented 3 years ago

Can you expand on this?

Anything is done outside of the resampling loop is essentially treated as deterministic. You can't measure the effect of something if you can't measure its changes.

A classic example is feature selection. If you do that outside of the resampling loop, there are no data to tell you that the operation was a bad idea or not. For FS, see this and this (and good "feature selection bias").

Bagimpute is really useful for highly correlated inputs

Without data, I disagree.

Imagine two correlated numeric predictors. A tree will try to approximate that using a series of step functions whereas a linear regression does it with two parameters (with very lower variance).

recipes can do linear imputation. From what you describe, this seems like it would be more effective and much more computationally efficient.