h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.85k stars 1.99k forks source link

get_params() not working with XGBoost and gridsearch #16042

Open wendycwong opened 7 months ago

wendycwong commented 7 months ago

follow up to support ticket: https://support.h2o.ai/a/tickets/107319

Here is what Gen has run into:

import h2o from h2o.estimators import H2OXGBoostEstimator from h2o.grid.grid_search import H2OGridSearch

h2o.init()

prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")

convert the CAPSULE column to a factor

prostate["CAPSULE"] = prostate["CAPSULE"].asfactor() response = "CAPSULE" seed = 1234

import random

GBM hyperparameters

gbm_params2 = {'learn_rate': [0.01], 'max_depth': [2], 'sample_rate': [0.1], 'col_sample_rate': [0.1], 'seed': random.sample(range(1, 1000), 100) # generating a sample of different seed values }

Monotone_constraints

monotone_constraints={"x3":1, "x5": 1}

Search criteria

search_criteria = {'strategy': 'RandomDiscrete', 'max_models': 2, 'seed': 1} # this will sample 36 different seed values from the options above

Train and validate a random grid of GBMs

gbm_grid = H2OGridSearch(model=H2OXGBoostEstimator(monotone_constraints={"AGE":1}), grid_id='xgboostt_cap', hyper_params=gbm_params2, search_criteria=search_criteria, ) gbm_grid.train(y=response, ignored_columns=["ID"], training_frame=prostate)

Get the grid results, sorted by validation AUC

gbm_gridperf2 = gbm_grid.get_grid(sort_by='auc', decreasing=True) gbm_gridperf2

best_gbm2 = gbm_gridperf2.models[0] best_gbm2.get_params()

image

wendycwong commented 7 months ago

I cobbled together the following and I can see that monotone constraint is set in the model:

assert H2OXGBoostEstimator.available() is True

# CPU Backend is forced for the results to be comparable
h2oParamsS = {"tree_method": "exact", "seed": 123, "backend": "cpu", "ntrees": 5}

trainFile = pyunit_utils.genTrainFrame(100, 10, enumCols=0, randseed=17)
print(trainFile)
myX = trainFile.names
y='response'
myX.remove(y)

h2oParamsS["monotone_constraints"] = {
    "C1": -1,
    "C3": 1,
    "C7": 1
}

gbm_params2 = {'learn_rate':[0.01, 0.02]}

gridM = H2OGridSearch(H2OXGBoostEstimator(**h2oParamsS), hyper_params=gbm_params2)
gridM.train(x=myX, y=y, training_frame=trainFile)
gridS = gridM.get_grid(sort_by="auc", decreasing=True)
best_gmb2 = gridS.models[0]
native_params2 = best_gmb2._model_json["output"]["native_parameters"].as_data_frame()
constraints2 = (native_params2[native_params2['name'] == "monotone_constraints"])['value'].values[0]
params = best_gmb2.get_params(deep=True)

h2oModelS = H2OXGBoostEstimator(**h2oParamsS)
h2oModelS.train(x=myX, y=y, training_frame=trainFile)

native_params = h2oModelS._model_json["output"]["native_parameters"].as_data_frame()
print(native_params)

constraints = (native_params[native_params['name'] == "monotone_constraints"])['value'].values[0]

assert constraints == u'(-1,0,1,0,0,0,1,0,0,0)'

Constraint2 is the same as constraints.

wendycwong commented 7 months ago

Something is wrong with get_params()...