h2oai / h2o-tutorials

Tutorials and training material for the H2O Machine Learning Platform
http://h2o.ai
1.47k stars 1.01k forks source link

H2O on a Multi-Node LSF Batch Submission System? #106

Open DevinRouth opened 5 years ago

DevinRouth commented 5 years ago

Hello,

I work in the Crowther Lab at ETH-Zürich, and we're starting to use H2O to crunch massive ecological datasets (1.2+ million rows with 75+ covariates) that we've collected. In terms of computing, ETH uses an LSF system to manage computational resources of the university clusters. When recently submitting some of the large models, we realized that H2O wasn't using all of the nodes that were assigned to the job. From the university cluster support staff: "it appears that the [script] can not use multiple nodes at the same time... I suggest you check your program's documentation to see whether it is possible to run it with distributed memory, so it can use more lower-memory nodes".

When diagnosing the issue, I found the article on how to use multiple nodes on a SLURM system and another article on how to use H2O on multi-node clusters.

Essentially, I'm unsure as to whether these types of approaches would work with an LSF system as the nodes that are used for each job are only assigned after a full program script has been submitted via Bash using batch submission functions. In other words, I don't know that it's possible to access the IP addresses of the connected nodes before the program has been submitted to the cluster.

Has anyone else had any experience with H2O on LSF based clusters? Have I missed a critical or obvious step somewhere that would allow H2O to access/distribute the memory across all of the nodes?

Thanks so much!

Cheers, Devin Routh

tomkraljevic commented 5 years ago

No, there is no current support for LSF-based environments.

If there is a Spark-based way of running on LSF (sorry, I don't know), then you could try to run sparkling water.

Otherwise, you will need to solve the "Cluster Formation" problem, where each of the nodes finds each other. H2O-3 does this for Hadoop by having a dedicated driver program. You can find the source code here:

You would either need to write something similar to that, or write a stub that collects the IP addresses of the worker nodes, distributes a flat file, distributes the jar file, and then starts each worker up.