tidymodels / tune

Tools for tidy parameter tuning
https://tune.tidymodels.org
Other
285 stars 42 forks source link

Fit preprocessor just once with `tune_bayes` #955

Open asb2111 opened 3 weeks ago

asb2111 commented 3 weeks ago

Feature

Currently, it appears that tune_bayes recomputes the entire preprocessor during every iteration, even if the preprocessor has nothing to tune. This can lead to a substantial amount of unnecessary computation as the preprocessor should only need to be executed once and could be reused for all iterations.

topepo commented 3 weeks ago

This is an excellent point.

Once a candidate is created by the Gaussian process model, we pass that to tune_grid(). It fits the new candidate for each resample, makes appropriate predictions, and gets metrics.

We would have to make substantial changes to tune_grid() to skip the preprocessor. Also, we’d also have to have it save each of the fitted models from the previous fit (assuming that the preprocessor is the same), take that as input, and then start the process when the supervised model is trained (for each resample).

We’ll have to consider this to see if there is a less invasive approach than the one described above (I don't think we can do that).

asb2111 commented 3 weeks ago

If there is nothing to tune in the preprocessor, could it be 'baked' prior to starting the tuning process altogether? Then the workflow gets modified to use the baked data and no preprocessor, the tuning is conducted, and then everything gets repackaged at the end? Maybe this is too much work for too little gain in a special case.

topepo commented 3 weeks ago

Unless you are using a validation set, we would not want to fit the preprocessor on the entire training set then fit the model on a potentially different data set (i.e., one that was a resample)

asb2111 commented 3 weeks ago

I'm thinking if we are passing in resamples we could bake the preprocessor on each resample in advance, or something like that.

asb2111 commented 3 weeks ago

Ok, I've tried hacking this together and I see why it won't work. The workflow expects the data in each resample to look similar (same columns, etc), but if we preprocess and glue the resamples back together, each resample could have different columns, and that breaks things down the line.

FWIW, I brought all this up because I have a workflow that involves step_lincomb which takes a very long time to run.