H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Motivation: We need to be able to compete more strongly in competitions, however some of the techniques we’d do when competing on the basis of model accuracy, are not what we normally do when we first approach a problem and/or build models for production. So we offer a new parameter to switch between modes, where the default is the current AutoML algorithm. This is the “presets” idea we have been planning for a while.
Initial options: {{["explore, "compete"]}}
{{explore}} (Default). This is the current AutoML, though we will continue to evolve the default mode over time too. We will leave Deep Learning in here by default, but in a future release we make an effort to improving the DNN grids. The name is chosen to reflect what this mode is good at — exploring a wide variety of algorithms, including stacked ensembles, efficiently and is meant to perform well across a wide variety of tasks. The expectation is that there will always be MOJOs for all the models returned here.
{{compete}} This will start off as a slight tweak to the default mode, and will evolve over time to focus more exclusively on model performance. 3.32.1.1 will feature two additional Stacked Ensembles (logit transforms). For classification, we will turn off the DNN grids because they are not pulling their weight on most datasets, compared to the tree-based ensembles like XGBoost and GBM, but we will keep the single, default DNN for stacking purposes. For regression, we will keep the DNN searches in tact (unless manually turned off using {{exclude_algos}}), because the Deep Learning seems to perform a bit better on regression tasks.
We plan to add one or more other options in the future, but for right now, this will suffice.
Technical notes:
In compete mode, the extra stacked ensembles can use model ids that get indexed like the other algos. We have some options:
We could keep the StackedEnsembleAllModels and StackedEnsembleBestOfFamily models names from explore mode, and then start the two new ones at 1, like StackedEnsemble_AllModels1? Or, 2?
Indexing all the SE models, starting at 1, including the two normal ones from default mode, so we have StackedEnsemble_AllModels1 and StackedEnsemble_AllModels_2 (for now).
Check that compete mode works with {{h2o.explain()}} - double check that the way we shorten model_ids for the plots doesn’t break with the new SE model_ids in compete mode. ([~accountid:5e43370f5a495e0c91a74ebe])
Updates to AutoML User Guide ([~accountid:557058:afd6e9a4-1891-4845-98ea-b5d34a2bc42c])
Add {{metalearner_transform}} to Stacked Ensemble User Guide ([~accountid:557058:afd6e9a4-1891-4845-98ea-b5d34a2bc42c])
New parameter for AutoML: {{mode}}
Motivation: We need to be able to compete more strongly in competitions, however some of the techniques we’d do when competing on the basis of model accuracy, are not what we normally do when we first approach a problem and/or build models for production. So we offer a new parameter to switch between modes, where the default is the current AutoML algorithm. This is the “presets” idea we have been planning for a while.
Initial options: {{["explore, "compete"]}}
We plan to add one or more other options in the future, but for right now, this will suffice.
Technical notes:
In compete mode, the extra stacked ensembles can use model ids that get indexed like the other algos. We have some options:
Related tasks: