h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.89k stars 1.99k forks source link

Unforce modulo folds in the Target Encoding enabled AutoML runs #7841

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

I am not sure if we need to do this or not, but I am making a ticket here to revisit this conversation we had on the [TE integration PR|https://github.com/h2oai/h2o-3/pull/4927#discussion_r490449726]:

Regarding the modulo folds (so that we can ensure same folds used in CV and modeling) – another option to handle this (rather than forcing modulo folds) is to add an error if the user does not set the seed arg when TE is turned on. If the user does set the seed, then we can ensure same folds across TE and modeling, right?

Maybe i am not thinking about this correctly, but if a user explicitly sets the seed, and also turns on TE... they will get different folds than if they had just set seed and not turned on TE, right? That will make runs difficult to compare (TE on/off), so that's why I am wondering if it's better to force the user to set a seed when using TE (or providing a fold_column)? This would force the user to be reproducible when TE is turned on.

The one drawback is that seed is not set by default in AutoML, so they will immediately get an error message if they try to turn on TE without setting a seed or setting fold_assignment = “modulo” or specifying fold_column. I don’t know if there are drawbacks to setting modulo folds or not… (let’s discuss).

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-7801 Assignee: Tomas Fryda Reporter: Erin LeDell State: Open Fix Version: 3.42.0.1 Attachments: N/A Development PRs: Available

Linked PRs from JIRA

https://github.com/h2oai/h2o-3/pull/6373