h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.86k stars 1.99k forks source link

Increase L1-regularization in the default Stacked Ensemble GLM metalearner #7653

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

We will need to test this out / benchmark this, however I think we would benefit by adding more L1-regularization to the default Stacked Ensemble GLM metalearner, because the “bad” base models are not getting zero-ed out enough in some cases.

Right now we use the default in our GLM, which is alpha = 0.5. I think we should try 1 (Full Lasso) and also a few more values closer to 1. This also produces a more efficient ensemble (with fewer active base models). We can also consider making alpha dynamic, and based on the number of base learners (more learners → higher alpha).

Alternatively, we can make the default metalearner do a grid search over alpha (or we can just do that grid search only in AutoML Stacked Ensembles…)

alpha:
Distribution of regularization between the L1 (Lasso) and L2 (Ridge) penalties. A value of 1 for alpha represents Lasso regression, a value of 0 produces Ridge regression, and anything in between specifies the amount of mixing between the two. Default value of alpha is 0 when SOLVER = 'L-BFGS'; 0.5 otherwise.

exalate-issue-sync[bot] commented 1 year ago

Sebastien Poirier commented: Now that [https://h2oai.atlassian.net/browse/PUBDEV-7481|https://h2oai.atlassian.net/browse/PUBDEV-7481|smart-link] is resolved, we should be able to enable multiple alphas in the default metalearner and comment out this:

{code:java} //parms._alpha = new double[] {0.0, 0.25, 0.5, 0.75, 1.0};{code}

I have some concerns regarding the training duration for the SEs with this though: we already now that SE take a long time on some datasets, and adding more alphas will even slow it down, so we need to come up with a decision about if/how to include SEs in the global runtime constraint.

exalate-issue-sync[bot] commented 1 year ago

Erin LeDell commented: [~accountid:5b153fb1b0d76456f36daced] Though it would produce better models, I don’t know that we need to do an alpha search by default (but we should try to find a better single alpha to use instead of 0.5 which I don’t think is strong enough). Once we enable the different presets/modes in AutoML, then an alpha search could be a used in the ‘compete’ mode. Or if it’s really working well and doesn’t take too much time, we can make alpha search a default in regular mode too.

Regarding SE runtime topic: One option is to find a way of estimate SE time (so we can include it in the global runtime constraint), either just a single estimate to be used across all datasets (e.g. 7% of global runtime reserved for SE), or we can make a formula that is dataset/resource dependent.

h2o-ops commented 1 year ago

JIRA Issue Details

Jira Issue: PUBDEV-7991 Assignee: UNASSIGNED Reporter: Erin LeDell State: Open Fix Version: Backlog Attachments: N/A Development PRs: N/A