h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.81k stars 1.99k forks source link

Unable to run multiple H2O AutoML processes simultaneously #8378

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

As a machine learning engineer, I need to be able to run multiple H2o AutoML processes simultaneously. I get the below error when attempting to run multiple instances on the same H2o server -- but this should be possible. I'm wondering if this is a namespace issue where the AutoML dataframe is being interrupted and I need to uniquely name one of the Auto_ml variables below. Thank you. Here is my code and the result:

1) CODE: {code:python} // From utils.py file

H2O Functions

def get_best_h2o_automl_model(train, test, valid, feature_col, y, excluded_algs): auto_ml = H2OAutoML(exclude_algos=['XGBoost']+excluded_algs, seed=1, max_runtime_secs=0) auto_ml.train(x=feature_col, y=y, training_frame=train, leaderboard_frame=test, validation_frame=valid)

leaderboard = auto_ml.leaderboard
model_performance = auto_ml.leader.model_performance(test)
print(leaderboard.head(rows=leaderboard.nrows))
print(model_performance)

return auto_ml

// # Run H2O automl to get the best model

    auto_ml = utils.get_best_h2o_automl_model(train, test, valid, cat_feature_col + num_feature_col, y, args.exclude_algs)
    model = auto_ml.leader
    r2_value = round(model.r2(), 2)

{code}

2) RESULT: {code:java} Starting at 2020-01-30T13:45:04.326775-08:00 Initializing H2O... Warning: if you don't want to start local H2O server, then use of h2o.connect() is preferred. Checking whether there is an H2O instance running at http://pg-pt-wn01-010.gld.XX.net:54321 . connected.


H2O cluster uptime: 13 days 21 hours 46 mins H2O cluster timezone: America/Los_Angeles H2O data parsing timezone: UTC H2O cluster version: 3.26.0.10 H2O cluster version age: 2 months and 23 days H2O cluster name: root H2O cluster total nodes: 2 H2O cluster free memory: 155.4 Gb H2O cluster total cores: 64 H2O cluster allowed cores: 64 H2O cluster status: locked, healthy H2O connection url: http://pg-pt-wn01-010.gld.XX.net:54321 H2O connection proxy: H2O internal security: False H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4 Python version: 2.7.5 final


Parse progress: [#########################################################] 100% Using working directory: /tmp/tmpBwkLsi AutoML progress: [#######################################ERROR - Process failed due to: Unexpected HTTP error: HTTPConnectionPool(host='pg-pt-wn01-010.gld.XX.net', port=54321): Max retries exceeded with url: /3/Jobs/$0301646e18b032d4ffffffff$_9f388873e9f331e6cdc1d77bf9544a48 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f92f57d18d0>: Failed to establish a new connection: [Errno -2] Name or service not known',)) Traceback (most recent call last): File "/usr/pic1/repos/ml-models-all/bdaMlScripts/h2o_job_pred_models.py", line 189, in main() File "/usr/pic1/repos/ml-models-all/bdaMlScripts/h2o_job_pred_models.py", line 174, in main h2o.remove(auto_ml) UnboundLocalError: local variable 'auto_ml' referenced before assignment {code}

exalate-issue-sync[bot] commented 1 year ago

Erin LeDell commented: [~accountid:557058:59501c65-23f2-4a22-af15-313526e2c87e] Sorry, somehow I missed this ticket when it was created! There’s a {{project_name}} parameter for AutoML, which will allow you to execute two different AutoML runs on the same dataset on the same H2O cluster. Here, if they are run on the same machine, they will still compete for resources, however.

You can also run two H2O instances on different ports on the same machine. e.g. {{h2o.init(port = 54321, nthreads = 32)}} and {{h2o.init(port = 55555, nthreads = 32)}}? I think this will use different sets of cores for each H2O instance, but I don’t think it’s guaranteed (the OS will try to balance this). The drawback here is that they can’t share data, so the training set will be duplicated. If you have enough RAM, this is probably better though.

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-7257 Assignee: UNASSIGNED Reporter: Michael Jules State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A