h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.87k stars 2k forks source link

Dump/Restore whole cluster state with one command #10239

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

Scenario: shutdown cluster, perhaps overnight, but want to start it up again tomorrow, with all the models/data there. Or perhaps, having built your first model, you realize 4 nodes is too slow, and you want to restart the cluster with 8 nodes.

Problem: Going through all the loads, any data munging, rebuilding models, etc. to set up a new session, takes time. Yes we can do exportFile and saveModel, but it creates lots of little files, and quite a lot of effort to load them all back in again, with lots of room for human error.

Goals: be able to do this with one simple command (from Flow, or from R/Python). It would make a single compressed binary file with all models, data and any other state, such that it would be as quick as it possibly can be to load it back in again.

(One binary file with data, one binary file with models, might be even better - a user might want to restore just one or the other.)

To discuss: what happens with models in the process of being built? (I'd suggest take a snapshot; but when reloaded the model building would not automatically restart.)

To discuss: what happens if a data file is being currently imported, or a data frame being merged, etc. Ideas: wait for the imports to complete, then run the binary dump; pause the imports, run the dump, then restart the import afterwards; tell the user and refuse to run.

Bonus: a dry-run option, which will just tell you how big the disk file(s) will be.

Bonus: an encryption option, using a gpg key.

Bonus: some filters, to either include or exclude certain frames and certain models.

Bonus: cope fine with a different-sized cluster, whether more nodes or fewer nodes (and the nodes having different number of cores or memory size). (In fact I believe this would end up being the primary use case.)

Bonus: if the data won't fit on the cluster, tell the user immediately, not 10 minutes into the import process!

(This feature request is slightly related to #1164, but I think it is fine for this particular feature to be H2O version specific.)

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-3324 Assignee: New H2O Bugs Reporter: Darren Cook State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A