tidymodels / workflowsets

Create a collection of modeling workflows
https://workflowsets.tidymodels.org/
Other
92 stars 10 forks source link

add support for workflow_future_map #80

Open yonicd opened 2 years ago

yonicd commented 2 years ago

right now workflow_map runs sequentially over the rows in the workflowset object and allows for parallelization within a model tune. For users that have HPC it would be great for a way to set a plan to control each row in the workflow set to be sent to a worker and run indep in each one.

library(future)
plan(list(batchtools_MY_HPC, multisession(workers(n=N))))
juliasilge commented 2 years ago

Is that preferred over running sequentially over the workflow sets and then parallelizing each individual workflow? I'm not sure I see why (but I am not a HPC expert).

yonicd commented 2 years ago

In the setup of workflowsets the models in the tibble are independent.

If there are compute resources that can accommodate running them in parallel, then that would be a preferred option to save time.

For example, if I have N models and I have a nested CV setup with K|P (outer|inner) layers then I would have NKP models to run. even with a simple 2 models, 3 outer and 100 inner you can inflate number of models to run very fast. It would be efficient to run in parallel beyond just the inner loop for tuning a given row of speficications.

topepo commented 2 years ago

There's definitely a good use case here but, right now, our implementation doesn't do what you want.

Personally, I think that it is a little risky. To have potentially long running parallel jobs both between- and within-machines might have issues where something goes wrong and you lose the whole thing. I think that we've made workflow sets pretty fault tolerance but haven't tried anything like this.

In the past, I would generate separate scripts per model and send them off to the queuing system.

Would you like to make a PR (@simonpcouch is that ok)? We don't have hardware for your use case so you would have to do testing across machines.

yonicd commented 2 years ago

Thanks for the feedback. My setup currently integrates @wlandau {targets} with {workflowsets} so i dont get into problems of losing partial successful runs, by mapping over the different models in the workflowsets.

I'd be happy to add a PR to show what my intuition for implementing a {furrr} based version of workflow_map would look like, where the fallback default would be plan(sequential) which is basically what there is now.

yonicd commented 2 years ago

Personally, I think that it is a little risky. To have potentially long running parallel jobs both between- and within-machines might have issues where something goes wrong and you lose the whole thing. I think that we've made workflow sets pretty fault tolerance but haven't tried anything like this.

this is very related to this issue that I opened a while back in {furrr}, where there is the weak spot in it to accommodate failed elements. https://github.com/DavisVaughan/furrr/issues/64

mglev1n commented 2 years ago

Thanks for the feedback. My setup currently integrates @wlandau {targets} with {workflowsets} so i dont get into problems of losing partial successful runs, by mapping over the different models in the workflowsets.

I'd be happy to add a PR to show what my intuition for implementing a {furrr} based version of workflow_map would look like, where the fallback default would be plan(sequential) which is basically what there is now.

Not sure if there's been any progress on this, but would be a nice feature.

@yonicd - If not, I also use the {targets} package, and was curious if you'd share your implementation? Presumably you're mapping over each of the workflow objects contained in the info column (using either dynamic or static branching) returned by workflowsets::workflow_set()?

Apologies if this is better suited for discussion in the {targets} repo.

simonpcouch commented 1 year ago

Similar request on SO.

simonpcouch commented 5 months ago

If this ever comes to the top of our to-do list, worth reading "Nested parallelism and protection against it' in Bengtsson (2021).