h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.85k stars 1.99k forks source link

AutoML kills h2o for long jobs #12520

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

We have a benchmark where we run lots of long jobs with default settings. It has now failed 3 times on a dataset called pc_krkopt. It looks like this


  File "/opt/benchmarks/H2OAIBenchmark.py", line 697, in <module>
    do_benchmark(config_file, git_sha, build_number, h2oai_git_sha, runtime_id)
  File "/opt/benchmarks/H2OAIBenchmark.py", line 671, in do_benchmark
    run(config_file, git_sha, build_number, h2oai_git_sha, runtime_id)  # Config file path, git-sha, build-number
  File "/opt/benchmarks/H2OAIBenchmark.py", line 665, in run
    detect_time_series=detect_time_series, test_file_path=test_file_path)
  File "/opt/benchmarks/H2OAIBenchmark.py", line 366, in run_benchmark
    stopping_metric=stopping_metric)
  File "/opt/benchmarks/H2OAIBenchmark.py", line 57, in do_automl
    leaderboard_frame=leaderboard_frame)
  File "/h2oai_env/lib/python3.6/site-packages/h2o/automl/autoh2o.py", line 363, in train
    self._job.poll()
  File "/h2oai_env/lib/python3.6/site-packages/h2o/job.py", line 58, in poll
    pb.execute(self._refresh_job_status)
  File "/h2oai_env/lib/python3.6/site-packages/h2o/utils/progressbar.py", line 169, in execute
    res = progress_fn()  # may raise StopIteration
  File "/h2oai_env/lib/python3.6/site-packages/h2o/job.py", line 93, in _refresh_job_status
    jobs = h2o.api("GET /3/Jobs/%s" % self.job_key)
  File "/h2oai_env/lib/python3.6/site-packages/h2o/h2o.py", line 103, in api
    return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to)
  File "/h2oai_env/lib/python3.6/site-packages/h2o/backend/connection.py", line 402, in request
    return self._process_response(resp, save_to)
  File "/h2oai_env/lib/python3.6/site-packages/h2o/backend/connection.py", line 730, in _process_response
    raise H2OServerError("HTTP %d %s:\n%r" % (status_code, response.reason, data))
h2o.exceptions.H2OServerError: HTTP 500 Server Error:
'Error: 500'```
It seems to kill h2o. (edited)
http://mr-0xc1:8080/job/h2oai-benchmark-many-defaults/9/
http://mr-0xc1:8080/job/h2oai-benchmark-many-defaults/7/console
http://mr-0xc1:8080/job/h2oai-benchmark-many-defaults/5/
exalate-issue-sync[bot] commented 1 year ago

Navdeep commented: How can we repro this? Doesn't seem straight forward.

exalate-issue-sync[bot] commented 1 year ago

Magnus Stensmo commented: Have you tried running it with the same database and the same settings? It consistently happens for the above.

exalate-issue-sync[bot] commented 1 year ago

Magnus Stensmo commented: This problem keeps reoccurring http://mr-0xc1:8080/blue/organizations/jenkins/h2oai-benchmark-many-defaults/detail/h2oai-benchmark-many-defaults/14/pipeline

exalate-issue-sync[bot] commented 1 year ago

Magnus Stensmo commented: I've added more try/except to avoid this killing benchmark jobs. When this happens now we get no result from automl.

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-5661 Assignee: UNASSIGNED Reporter: Magnus Stensmo State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A