Dump/Restore whole cluster state with one command

Scenario: shutdown cluster, perhaps overnight, but want to start it up again tomorrow, with all the models/data there. Or perhaps, having built your first model, you realize 4 nodes is too slow, and you want to restart the cluster with 8 nodes.

Problem: Going through all the loads, any data munging, rebuilding models, etc. to set up a new session, takes time. Yes we can do exportFile and saveModel, but it creates lots of little files, and quite a lot of effort to load them all back in again, with lots of room for human error.

Goals: be able to do this with one simple command (from Flow, or from R/Python). It would make a single compressed binary file with all models, data and any other state, such that it would be as quick as it possibly can be to load it back in again.

(One binary file with data, one binary file with models, might be even better - a user might want to restore just one or the other.)

To discuss: what happens with models in the process of being built? (I'd suggest take a snapshot; but when reloaded the model building would not automatically restart.)

To discuss: what happens if a data file is being currently imported, or a data frame being merged, etc. Ideas: wait for the imports to complete, then run the binary dump; pause the imports, run the dump, then restart the import afterwards; tell the user and refuse to run.

Bonus: a dry-run option, which will just tell you how big the disk file(s) will be.

Bonus: an encryption option, using a gpg key.

Bonus: some filters, to either include or exclude certain frames and certain models.

Bonus: cope fine with a different-sized cluster, whether more nodes or fewer nodes (and the nodes having different number of cores or memory size). (In fact I believe this would end up being the primary use case.)

Bonus: if the data won't fit on the cluster, tell the user immediately, not 10 minutes into the import process!

(This feature request is slightly related to #1164, but I think it is fine for this particular feature to be H2O version specific.)

h2oai / h2o-3

Dump/Restore whole cluster state with one command #10239