h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.9k stars 2k forks source link

Real YARN port of H2O Epic #13627

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

List of some stuff that needs to be done.

DEVELOPMENT

Yarn Client

[ ] Simple YARN client that can get resource and queue info from RM and print it out [ ] Queue math to figure out a command-line that might work. This is just a best guess. [ ] Add better error message to existing h2o-dev yarn client (without any functional changes) to print info about resource acquisition failure [ ] Job works when kerberos is in place [ ] Launch AM [ ] Handle Ctrl-C shutdown [ ] Handle H2O clustering and mapper->driver messages

Application Master

[ ] Figure out how to write AM log to proper place [ ] Check yarn properties too low [ ] Make container resource request [ ] AM Web page reachable from RM Web UI (with buttons to look at stdout/stderr logs) [ ] Launch containers [ ] Handle launch failure due to lack of resources [ ] (Steady-state) Handle container failure [ ] (Steady-state) Handle RM failure (possibly by logging and killing the job) [ ] (Steady-state) Heartbeat thread [ ] Handle shutdown

Container

[ ] Figure out how to write container log to proper place [ ] Figure out where ice_root should go (container local dir) [ ] Set up EmbeddedH2O object [ ] Handle H2O clustering mapper->driver messages [ ] Start H2O [ ] HDFS works when kerberos is in place

TESTING

[ ] Test on CDH5.2 [ ] Test on CDH5.3 [ ] Test on HDP2.1 [ ] Test on HDP2.2 [ ] Test on MapR3 [ ] Test on MapR4

DinukaH2O commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-635 Assignee: New H2O Bugs Reporter: Tom Kraljevic State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A