exalate-issue-sync[bot] commented 1 year ago

Currently, H2O stores two different versions of the cross-validation metrics. There is one version stored in the "Cross-validation metrics table" (which are the true CV metrics, averaged across folds) and a second version which is the metric calculated once across all the cv predictions. The second version is what you get when you use the h2o.performance() function or the individual functions like h2o.auc(), so most users will be using “incorrect” one. It is bad to have two different point estimates of a single parameter stored in a model.

The reason for computing the metric once across the aggregated cv preds was motivated by the need to plot the ROC curve easily as one curve (instead of several ROC curves, one per fold), however that is not a good enough reason to report two sets of metrics, where the one that people use the most (via the h2o.performance() function is technically & statistically “incorrect”).

For plotting the ROC curve for xval models, we can still use the aggregated metrics, but the values that are returned by the h2o.performance() function should pull from the true, mean values of those metrics across folds (it should print the correct ones too).

Code example:

{code:java}library(h2o) h2o.init()

train <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv") y <- "response" x <- setdiff(names(train), y)

For binary classification, response should be a factor

train[,y] <- as.factor(train[,y])

fit <- h2o.gbm(x = x, y = y, training_frame = train, nfolds = 3) fit@model$cross_validation_metrics_summary #one set of metrics (correct)

Comparisons

h2o.auc(h2o.performance(fit, xval = TRUE))

0.7763038 vs 0.776584 from the xval table

h2o.mse(h2o.performance(fit, xval = TRUE))

0.1921713 vs 0.19217622 from the xval table

h2o.rmse(h2o.performance(fit, xval = TRUE))

0.4383734 vs 0.43837336 from the xval table{code}

exalate-issue-sync[bot] commented 1 year ago

Erin LeDell commented: [~accountid:557058:04659f86-fbfe-4d01-90c9-146c34df6ee6] Would love to get this fixed in 3.34.0.1! I got another question about this on Stack Overflow the other day and had to explain that we have two “versions” of CV AUC (the correct one and the not correct one): [https://stackoverflow.com/questions/64032018/retrieve-cross-validation-performance-auc-on-h2o-automl-for-holdout-dataset/64057390#64057390|https://stackoverflow.com/questions/64032018/retrieve-cross-validation-performance-auc-on-h2o-automl-for-holdout-dataset/64057390#64057390]

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-4975 Assignee: Michal Kurka Reporter: Erin LeDell State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A

h2oai / h2o-3

Cross-validation metrics are internally inconsistent & should always report mean values #11851

For binary classification, response should be a factor

Comparisons

0.7763038 vs 0.776584 from the xval table

0.1921713 vs 0.19217622 from the xval table

0.4383734 vs 0.43837336 from the xval table{code}