Open exalate-issue-sync[bot] opened 1 year ago
Megan Kurka commented: This paper might be helpful regarding overfitting on leaderboard: [https://arxiv.org/abs/1506.0262|https://arxiv.org/abs/1506.02629]
Erin LeDell commented: Right now the ticket is for “AutoML” overfitting – we can leave it that way for now, but the real issue is an H2O-wide issue (we might need to break this into a few tickets).
First comment: We already know there is bias (aka overfitting) on the default CV leaderboard. The default CV leaderboard (we use CV metrics for efficiency) is known to be slightly biased. The CV metrics are reported, yet H2O models use CV for early stopping when CV is turned on (this currently cant be turned off by the user, even when the user provides a validation_frame. Documented [here|https://0xdata.atlassian.net/browse/PUBDEV-5049]. This is a known issue that we should fix.)
Second comment: AutoML users are encouraged to pass a leaderboard frame to get actual test metrics.
There are a few issues that this ticket should address:
JIRA Issue Migration Info
Jira Issue: PUBDEV-7715 Assignee: UNASSIGNED Reporter: Megan Kurka State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A
Top AutoML models show huge disparity between training metrics and test or CV metrics. This is especially the case when one variable is found to have much greater importance when the rest.