H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
If you try to set max_runtime_secs in a Cartesian grid search, it produces a confusing error:
{code} H2OResponseError: Server error water.exceptions.H2OIllegalArgumentException: Error: Unknown parameter: max_runtime_secs Request: POST /99/Grid/gbm data: {'seed': '1234', 'ignored_columns': '["ActualElapsedTime","CRSElapsedTime","DepDelay","CRSArrTime","CarrierDelay","CancellationCode","ArrDelay","LateAircraftDelay","DayofMonth","Diverted","CRSDepTime","Cancelled","SecurityDelay","DepTime","TailNum","TaxiIn","IsArrDelayed","NASDelay","TaxiOut","AirTime","WeatherDelay","ArrTime"]', 'stopping_rounds': '5', 'stopping_tolerance': '0.0001', 'search_criteria': "{'max_runtime_secs': 30, 'strategy': 'Cartesian'}", 'response_column': 'IsDepDelayed', 'validation_frame': 'py_30_sid_a2e6', 'hyper_parameters': "{'col_sample_rate': [0.3, 0.7, 0.8, 1]}", 'training_frame': 'py_31_sid_a2e6', 'stopping_metric': 'AUC'} {code}
This should be updated to say something like: max_runtime_secs can only be used when strategy = "RandomDiscrete"
Reproducible Python example below:
{code} import h2o from h2o.estimators.gbm import H2OGradientBoostingEstimator h2o.init(strict_version_check = False)
import the airlines dataset:
This dataset is used to classify whether a flight will be delayed 'YES' or not "NO"
original data can be found at http://www.transtats.bts.gov/
airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
convert columns to factors
airlines["Year"]= airlines["Year"].asfactor() airlines["Month"]= airlines["Month"].asfactor() airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() airlines["Cancelled"] = airlines["Cancelled"].asfactor() airlines['FlightNum'] = airlines['FlightNum'].asfactor()
set the predictor names and the response column name
predictors = ["Origin", "Dest", "Year", "UniqueCarrier", "DayOfWeek", "Month", "Distance", "FlightNum"] response = "IsDepDelayed"
split into train and validation sets
train, valid= airlines.split_frame(ratios = [.8], seed = 1234)
try using the
col_sample_rate
parameter:initialize your estimator
airlines_gbm = H2OGradientBoostingEstimator(col_sample_rate = .7, seed =1234)
then train your model
airlines_gbm.train(x = predictors, y = response, training_frame = train, validation_frame = valid)
print the auc for the validation data
print(airlines_gbm.auc(valid=True))
Example of values to grid over for
col_sample_rate
import Grid Search
from h2o.grid.grid_search import H2OGridSearch
select the values for col_sample_rate to grid over
hyper_params = {'col_sample_rate': [.3, .7, .8, 1]}
this example uses cartesian grid search because the search space is small
and we want to see the performance of all models. For a larger search space use
random grid search instead: {'strategy': "RandomDiscrete"}
initialize the GBM estimator
use early stopping once the validation AUC doesn't improve by at least 0.01% for
5 consecutive scoring events
airlines_gbm_2 = H2OGradientBoostingEstimator(seed = 1234, stopping_rounds = 5, stopping_metric = "AUC", stopping_tolerance = 1e-4)
build grid search with previously made GBM and hyper parameters
grid = H2OGridSearch(model = airlines_gbm_2, hyper_params = hyper_params, search_criteria = {'strategy': "Cartesian", "max_runtime_secs": 30})
train using the grid
grid.train(x = predictors, y = response, training_frame = train, validation_frame = valid)
sort the grid models by decreasing AUC
sorted_grid = grid.get_grid(sort_by = 'auc', decreasing = True) print(sorted_grid) {code}