h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.92k stars 2k forks source link

Return Actual Train Metrics for Random Forest #12649

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

We should change RF so that when you ask for training metrics, you get actual training metrics instead of OOB metrics. Right now, if you extract “train” metrics, they are OOB metrics and there’s no way to get the train metrics other than to manually recreate them using h2o.performance() on the training set. This is confusing to users and not consistent with our definition of “training metrics” and possibly having adverse affects on our ability to do early stopping.

Note: when making the change make sure that checkpointing still work as expected.

More details: Currently, if you look up the evaluation metrics for Random Forest:

(using: 3.20.0.3 - DRF + iris - default settings)

OOB evaluation metrics will be used when you ask for the training metrics (it doesn't seem like passing in a validation frame or specifying nfolds changes this):

In Flow it will state OUTPUT - Training_Metrics

In R (same for python) if you do h2o.mse(fit, train = TRUE) it will return the OOB metrics instead of the train metrics. To get the actual train metrics you would need to do h2o.mse(h2o.performance(fit, newdata = as.h2o(iris)))

Code Snippet for testing:

{code:r} library(h2o) h2o.init()

When only the training frame is supplied:

h2o.mse(fit, train = TRUE) != h2o.mse(h2o.performance(fit, newdata = as.h2o(iris)))

fit <- h2o.randomForest(x = 1:4, y = 5, training_frame = as.h2o(iris), seed = 1234) h2o.mse(fit, train = TRUE) h2o.mse(h2o.performance(fit, newdata = as.h2o(iris)))

doesn't seem that providing nfolds changes anything

fit2 <- h2o.randomForest(x = 1:4, y = 5, training_frame = as.h2o(iris), nfolds = 3, seed = 1234) h2o.mse(fit2, train = TRUE) h2o.mse(fit2, xval = TRUE) h2o.mse(h2o.performance(fit2, newdata = as.h2o(iris)))

split the data to see if passing a validation fram changes anything

iris.split <- h2o.splitFrame(as.h2o(iris), ratios = c(0.2, 0.5)) train1 <- iris.split[[1]] valid1 <- iris.split[[2]]

check when only a training frame is supplied

fit2a <- h2o.randomForest(x = 1:4, y = 5, training_frame = train1, seed = 1234) h2o.mse(fit2a, train = TRUE) h2o.mse(h2o.performance(fit2a, newdata = train1))

when supplying a validation frame should the h2o.mse(fit3, train = TRUE) be

equal to h2o.mse(h2o.performance(fit3, newdata = train1))?

fit3 <- h2o.randomForest(x = 1:4, y = 5, training_frame = train1, validation_frame = valid1, seed = 1234) h2o.mse(fit3, train = TRUE) h2o.mse(fit3, valid = TRUE) h2o.mse(h2o.performance(fit3, newdata = train1)) {code}

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-5795 Assignee: Michal Raška Reporter: Lauren DiPerna State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A