h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.92k stars 2k forks source link

Better error message when user specifies Cartesian grid & max_runtime_secs #11317

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

If you try to set max_runtime_secs in a Cartesian grid search, it produces a confusing error:

{code} H2OResponseError: Server error water.exceptions.H2OIllegalArgumentException: Error: Unknown parameter: max_runtime_secs Request: POST /99/Grid/gbm data: {'seed': '1234', 'ignored_columns': '["ActualElapsedTime","CRSElapsedTime","DepDelay","CRSArrTime","CarrierDelay","CancellationCode","ArrDelay","LateAircraftDelay","DayofMonth","Diverted","CRSDepTime","Cancelled","SecurityDelay","DepTime","TailNum","TaxiIn","IsArrDelayed","NASDelay","TaxiOut","AirTime","WeatherDelay","ArrTime"]', 'stopping_rounds': '5', 'stopping_tolerance': '0.0001', 'search_criteria': "{'max_runtime_secs': 30, 'strategy': 'Cartesian'}", 'response_column': 'IsDepDelayed', 'validation_frame': 'py_30_sid_a2e6', 'hyper_parameters': "{'col_sample_rate': [0.3, 0.7, 0.8, 1]}", 'training_frame': 'py_31_sid_a2e6', 'stopping_metric': 'AUC'} {code}

This should be updated to say something like: max_runtime_secs can only be used when strategy = "RandomDiscrete"

Reproducible Python example below:

{code} import h2o from h2o.estimators.gbm import H2OGradientBoostingEstimator h2o.init(strict_version_check = False)

import the airlines dataset:

This dataset is used to classify whether a flight will be delayed 'YES' or not "NO"

original data can be found at http://www.transtats.bts.gov/

airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")

convert columns to factors

airlines["Year"]= airlines["Year"].asfactor() airlines["Month"]= airlines["Month"].asfactor() airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() airlines["Cancelled"] = airlines["Cancelled"].asfactor() airlines['FlightNum'] = airlines['FlightNum'].asfactor()

set the predictor names and the response column name

predictors = ["Origin", "Dest", "Year", "UniqueCarrier", "DayOfWeek", "Month", "Distance", "FlightNum"] response = "IsDepDelayed"

split into train and validation sets

train, valid= airlines.split_frame(ratios = [.8], seed = 1234)

try using the col_sample_rate parameter:

initialize your estimator

airlines_gbm = H2OGradientBoostingEstimator(col_sample_rate = .7, seed =1234)

then train your model

airlines_gbm.train(x = predictors, y = response, training_frame = train, validation_frame = valid)

print the auc for the validation data

print(airlines_gbm.auc(valid=True))

Example of values to grid over for col_sample_rate

import Grid Search

from h2o.grid.grid_search import H2OGridSearch

select the values for col_sample_rate to grid over

hyper_params = {'col_sample_rate': [.3, .7, .8, 1]}

this example uses cartesian grid search because the search space is small

and we want to see the performance of all models. For a larger search space use

random grid search instead: {'strategy': "RandomDiscrete"}

initialize the GBM estimator

use early stopping once the validation AUC doesn't improve by at least 0.01% for

5 consecutive scoring events

airlines_gbm_2 = H2OGradientBoostingEstimator(seed = 1234, stopping_rounds = 5, stopping_metric = "AUC", stopping_tolerance = 1e-4)

build grid search with previously made GBM and hyper parameters

grid = H2OGridSearch(model = airlines_gbm_2, hyper_params = hyper_params, search_criteria = {'strategy': "Cartesian", "max_runtime_secs": 30})

train using the grid

grid.train(x = predictors, y = response, training_frame = train, validation_frame = valid)

sort the grid models by decreasing AUC

sorted_grid = grid.get_grid(sort_by = 'auc', decreasing = True) print(sorted_grid) {code}

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-4432 Assignee: New H2O Bugs Reporter: Erin LeDell State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A