H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Regarding the modulo folds (so that we can ensure same folds used in CV and modeling) – another option to handle this (rather than forcing modulo folds) is to add an error if the user does not set the seed arg when TE is turned on. If the user does set the seed, then we can ensure same folds across TE and modeling, right?
Maybe i am not thinking about this correctly, but if a user explicitly sets the seed, and also turns on TE... they will get different folds than if they had just set seed and not turned on TE, right? That will make runs difficult to compare (TE on/off), so that's why I am wondering if it's better to force the user to set a seed when using TE (or providing a fold_column)? This would force the user to be reproducible when TE is turned on.
The one drawback is that seed is not set by default in AutoML, so they will immediately get an error message if they try to turn on TE without setting a seed or setting fold_assignment = “modulo” or specifying fold_column. I don’t know if there are drawbacks to setting modulo folds or not… (let’s discuss).
I am not sure if we need to do this or not, but I am making a ticket here to revisit this conversation we had on the [TE integration PR|https://github.com/h2oai/h2o-3/pull/4927#discussion_r490449726]:
Regarding the modulo folds (so that we can ensure same folds used in CV and modeling) – another option to handle this (rather than forcing modulo folds) is to add an error if the user does not set the seed arg when TE is turned on. If the user does set the seed, then we can ensure same folds across TE and modeling, right?
Maybe i am not thinking about this correctly, but if a user explicitly sets the seed, and also turns on TE... they will get different folds than if they had just set seed and not turned on TE, right? That will make runs difficult to compare (TE on/off), so that's why I am wondering if it's better to force the user to set a seed when using TE (or providing a fold_column)? This would force the user to be reproducible when TE is turned on.
The one drawback is that seed is not set by default in AutoML, so they will immediately get an error message if they try to turn on TE without setting a seed or setting fold_assignment = “modulo” or specifying fold_column. I don’t know if there are drawbacks to setting modulo folds or not… (let’s discuss).