h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.78k stars 1.99k forks source link

AutoML Models Overfitting #7924

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

Top AutoML models show huge disparity between training metrics and test or CV metrics. This is especially the case when one variable is found to have much greater importance when the rest.

exalate-issue-sync[bot] commented 1 year ago

Megan Kurka commented: This paper might be helpful regarding overfitting on leaderboard: [https://arxiv.org/abs/1506.0262|https://arxiv.org/abs/1506.02629]

exalate-issue-sync[bot] commented 1 year ago

Erin LeDell commented: Right now the ticket is for “AutoML” overfitting – we can leave it that way for now, but the real issue is an H2O-wide issue (we might need to break this into a few tickets).

First comment: We already know there is bias (aka overfitting) on the default CV leaderboard. The default CV leaderboard (we use CV metrics for efficiency) is known to be slightly biased. The CV metrics are reported, yet H2O models use CV for early stopping when CV is turned on (this currently cant be turned off by the user, even when the user provides a validation_frame. Documented [here|https://0xdata.atlassian.net/browse/PUBDEV-5049]. This is a known issue that we should fix.)

Second comment: AutoML users are encouraged to pass a leaderboard frame to get actual test metrics.

There are a few issues that this ticket should address:

We first need to decide on how we are measuring “overfitting” here – either there is a difference between CV and test metrics, and/or the rankings on the CV and test leaderboards. How should we “overfitting” vs “not ovefitting” so we can define what the completion/success is for this task?

We can solve [PUBEV-5049|https://0xdata.atlassian.net/browse/PUBDEV-5049] (described above), and see if that satisfactorily resolves the issue.

Lastly, we can consider implementing reusable holdout, and use those metrics on the leaderboard instead of CV metrics (or provide a way for them to be used instead of CV metrics). Or we can try this first, instead of resolving PUBDEV-5049.

What is the connection to one feature having super high importance – is this only an issue when there is some target leakage? (from that super highly predictive col).

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-7715 Assignee: UNASSIGNED Reporter: Megan Kurka State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A