Open mdancho84 opened 3 years ago
Hi Matt,
Thanks for getting in touch! I would definitely welcome collaboration. Some background - I started h2oparsnip
because I wanted to mix h2o models into a workflow that might not be overall centred on h2o. However, the questions that have arisen since mostly concern striking the right balance between using data efficiently within the cluster, but still working with the tidymodels approach. I took the assumption (and from my own use cases) that the data won’t be exclusively residing in the cluster because otherwise you probably should be using h2o directly. Parsnip also doesn’t currently accept H2OFrames (although I guess it could do because it supports spark).
Max Kuhn brought up the drawback of tuning within the tune
infrastructure, which if working on a remote cluster would lead to lots of back and forth with the data. Although I typically work with data in the same location, occasionally I connect to a remote cluster and it does add overhead. Because of this, there is an experimental 'tune_grid_h2o' function that moves data into the cluster once and accepts/returns a resamples
object that can otherwise be used in tune
to finalize parsnip models etc. I’ve been working on other things over the past few months and I haven’t given it enough thought, but there are trade-offs either way. First, to minimize data transfer, the scoring also has to occur in the cluster, which restricts the metrics to those that h2o supports. Also, you cannot tune recipe parameters. So potential options to move forward include:
Finish tune_grid_h2o
and/or possibly extend it so that it supports recipe parameters while attempting to minimize the back-and-forth, accepting that there will be some movement of data to the cluster.
Ask for parsnip
to accept H2OFrames and/or use the h2o.grid
function inside each model_specification so that any hyper_params
supplied by engine argument are used for tuning automatically. This way the data can stay entirely within the cluster and it requires almost no work for the package, however, you obviously can’t tune recipes, nor control the resampling like in tune
. It will also be awkward to select the best model other than via the default metric. Most problematic is that if someone tunes the model this way, but also tunes a recipe using tune
then the resampling scheme will not be correct, so I feel that (1) is better.
Managing data in the cluster in general. When you use h2o via parsnip as a drop-in replacement, it’s easy to forget about managing the cluster and removing model clutter, particularly with tuning. To some extent, this could be partially managed in (1) via some control
options to specify if the predictions and/or models are removed/retained.
The other aspect is the automl, which I haven’t used much, so other than the basic model specification I don’t have a good understanding of use cases, and particularly how you would want to use automl within a more composable workflow like tidymodels. So more work on that aspect in particular, or even just discussion on use cases, would also be really welcome.
I’m not sure how this fits into your plans but I’d appreciate any discussion and thoughts!
Steve
Hey Steve,
Thanks for getting back. Yeah, when I read this I immediately thought about the challenges of the H2O Frame vs Data Frame and the expensive data movement/transfer headaches that can result. I feel like simplification can help, and then tackling the tougher problems down the road might be a better approach, especially if we need support from Tidymodels (Max et al).
H2O AutoML is one of the greatest features. It creates many models, and reduces the need for manual tuning. This can be a big benefit because most of the data transfer is a result of Data Frame vs H2O Frame conversions causing data to move in and out of the cluster.
My gut is telling me to start here. It's just so powerful and easy to use. Plus it removes the demands for tuning.
It looks like you have the heavy lifting done here. I'll take a deeper look to see if there's anything that I can add.
This is where the data transfers will cause big expenses. You're right - rather than transfer data, possibly look to transfer the tuning parameters. This could be challenging but will limit the data transfers during tuning.
Hi @stevenpawley ,
Nice looking package! I'm very interested in this. I wanted to see if I can help progress your development efforts.
Background
I wanted to reach out as the developer of
modeltime
(a forecasting ecosystem based ontidymodels
) and an educator at Business Science, where I'm forming a small team of students to help with software development efforts.Our next project is combining Modeltime (leverages tidymodels) with H2O. It appears that you have covered a lot of the heavy lifting in your
h2oparsnip
package.I wanted to see if there is an opportunity to collaborate. We are looking at making a smaller, focused package that uses the
h2o
automl algorithm as a forecasting tool.Next Steps
h2oparsnip
make its way onto CRAN. We can help with this.h2oparsnip
development process.