h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.85k stars 1.99k forks source link

Random Forest metrics inconsistency/error #12110

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

[^prostate.csv]

For random forest, training metrics seem to be wrong, for two reasons: 1) they are much higher than validation errors for many trees 2) the metrics when calculated from the model do not match the metrics when calculated manually:

{code:python} import h2o h2o.init() prostate_df = h2o.import_file("prostate.csv") prostate_df['CAPSULE'] = prostate_df['CAPSULE'].asfactor() train_df, valid_df = prostate_df.split_frame(ratios=[0.7], seed=1001)

from h2o.estimators.random_forest import H2ORandomForestEstimator

rf = H2ORandomForestEstimator(model_id="rf", ntrees=20, seed=1001) rf.train(y='CAPSULE', training_frame=train_df, validation_frame=valid_df) rf.plot()

metrics from the model

print('Train logloss: ' + str(rf.logloss())) print('Valid logloss: ' + str(rf.logloss(valid=True))) print('Train AUC: ' + str(rf.auc())) print('Valid AUC: ' + str(rf.auc(valid=True)))

compare to metrics calculated directly

print("Train logloss: " + str(h2o.make_metrics(rf.predict(train_df)['p1'], train_df['CAPSULE'], domain=['0','1']).logloss())) print("Valid logloss: " + str(h2o.make_metrics(rf.predict(valid_df)['p1'], valid_df['CAPSULE'], domain=['0','1']).logloss())) print("Train AUC: " + str(h2o.make_metrics(rf.predict(train_df)['p1'], train_df['CAPSULE'], domain=['0','1']).auc())) print("Valid AUC: " + str(h2o.make_metrics(rf.predict(valid_df)['p1'], valid_df['CAPSULE'], domain=['0','1']).auc())) {code}

exalate-issue-sync[bot] commented 1 year ago

Nidhi Mehta commented: [~accountid:5c37d510ff324728a1da9017] this is because while building model default sample rate is ~0.6

exalate-issue-sync[bot] commented 1 year ago

Ben Campbell commented: I can replicate the issue with higher sampling rate, and with larger data sets, e.g. the credit card data (22,000 rows):

{code:python} import h2o h2o.init() credit_df = h2o.import_file("credit_card_train.csv") credit_df['DEFAULT_PAYMENT_NEXT_MONTH'] = credit_df['DEFAULT_PAYMENT_NEXT_MONTH'].asfactor() train_df, valid_df = credit_df.split_frame(ratios=[0.7], seed=1001)

from h2o.estimators.random_forest import H2ORandomForestEstimator

rf = H2ORandomForestEstimator(model_id="rf", ntrees=100, seed=1001, sample_rate=0.99, col_sample_rate_per_tree=0.9) rf.train(y='DEFAULT_PAYMENT_NEXT_MONTH', training_frame=train_df, validation_frame=valid_df) rf.plot()

metrics from the model

print('Train logloss: ' + str(rf.logloss())) print('Valid logloss: ' + str(rf.logloss(valid=True))) print('Train AUC: ' + str(rf.auc())) print('Valid AUC: ' + str(rf.auc(valid=True)))

compare to metrics calculated directly

print("Train logloss: " + str(h2o.make_metrics(rf.predict(train_df)['Yes'], train_df['DEFAULT_PAYMENT_NEXT_MONTH'], domain=['No','Yes']).logloss())) print("Valid logloss: " + str(h2o.make_metrics(rf.predict(valid_df)['Yes'], valid_df['DEFAULT_PAYMENT_NEXT_MONTH'], domain=['No','Yes']).logloss())) print("Train AUC: " + str(h2o.make_metrics(rf.predict(train_df)['Yes'], train_df['DEFAULT_PAYMENT_NEXT_MONTH'], domain=['No','Yes']).auc())) print("Valid AUC: " + str(h2o.make_metrics(rf.predict(valid_df)['Yes'], valid_df['DEFAULT_PAYMENT_NEXT_MONTH'], domain=['No','Yes']).auc())) {code}

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-5238 Assignee: Michal Kurka Reporter: Ben Campbell State: Open Fix Version: N/A Attachments: Available (Count: 1) Development PRs: N/A

Attachments From Jira

Attachment Name: prostate.csv Attached By: Ben Campbell File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5238/prostate.csv