h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.92k stars 2k forks source link

.model_performance object has auc() + other member functions that fail b/c _metric_json is missing keys #8304

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

Short version:

{code:python} /usr/local/lib/python3.6/dist-packages/h2o/model/metrics_base.py in auc(self) 191 def auc(self): 192 """The AUC for this set of metrics.""" --> 193 return self._metric_json['AUC'] KeyError: 'AUC' {code}

but if you get the keys of _metric_json, you can see that a lot of keys are missing for responding member functions

{code:python} ._metric_json.keys()

dict_keys(['__meta', 'model', 'model_checksum', 'frame', 'frame_checksum', 'description', 'model_category', 'scoring_time', 'predictions', 'MSE', 'RMSE', 'nobs', 'custom_metric_name', 'custom_metric_value', 'r2', 'hit_ratio_table', 'cm', 'logloss', 'mean_per_class_error']) {code}

These member functions fail fore the same reason aucpr, aic, gini, null_deviance. Might be others.

Either the data needs to be added back to the json or the broken member functions need to go away (mark deprecated.)

In case you want reproducible code:

{code:python} import h2o h2o.init() iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv") iris["class" ] = iris["class" ].asfactor() from h2o.estimators import H2OXGBoostEstimator

from h2o.grid.grid_search import H2OGridSearch

xgboost_params = { "ntrees" : [3,4] # ,5,6,7,8] , "max_depth" : [3,4] # , 5,6,7,8,9,10,11] , "learn_rate" : [0.01, 0.02] # , 0.04] , "sample_rate" : [0.5, 0.6, 0.7] #, 0.8] , "col_sample_rate_per_tree" : [0.5, 0.6] #, 0.7, 0.8, 0.9] , "min_rows" : [3,4] #,6,7] , "seed": [42] }

search_criteria = {'strategy': 'RandomDiscrete', 'max_models': 20, 'seed': 1}

xgboost_grid1 = H2OGridSearch(model= H2OXGBoostEstimator,
grid_id='xgboost_grid_cartesian', hyper_params=xgboost_params)

xgboost_grid1.train(x=x, y='class', training_frame=iris, validation_frame=iris, seed=42)

Get the grid results, sorted by validation AUC

xgboost_grid1_perf = xgboost_grid1.get_grid(sort_by='logloss', decreasing=True) # would be nice to use 'auc' print(xgboost_grid1_perf)

Grab the top XGB model, chosen by validation AUC

best_xgb1 = xgboost_grid1_perf.models[0]

Now let's evaluate the model performance on a test set

so we get an honest estimate of top model performance

best_xgb1_perf1 = best_xgb1.model_performance(iris)

print(best_xgb1_perf1._metric_json) best_xgb1_perf1.auc()

{code}

exalate-issue-sync[bot] commented 1 year ago

Clem Wang commented: logloss is another function that fails:

{noformat}/usr/local/lib/python3.6/dist-packages/h2o/model/metrics_base.py in logloss(self) 176 def logloss(self): 177 """Log loss.""" --> 178 return self._metric_json["logloss"] 179 180

KeyError: 'logloss'{noformat}

This is too bad, because logloss is an option for GridSearch

{code:python}xgboost_grid2_perf = xgboost_grid2.get_grid(sort_by='logloss', # rmsedecreasing=False) # Lower is better for Logloss{code}

{noformat}{noformat}

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-7333 Assignee: New H2O Bugs Reporter: Clem Wang State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A