h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

http://h2o.ai

Apache License 2.0

6.9k stars 2k forks source link

Implement Blending for Stacked Ensembles #11559

Closed exalate-issue-sync[bot] closed 1 year ago

exalate-issue-sync[bot] commented 1 year ago

This is the version of stacking where you don't use cv-preds to train the metalearner, but instead you score the base models on a holdout set and use those predicted values instead.

I'm not sure yet whether this should go into the existing Stacked Ensemble class, or if we should create a new one specifically for this case. The resulting model is the same though, so it should probably use Stacked Ensemble (with relaxed restrictions on the input models).

There are two main motivations here:

This is faster than cross-validating the base learners (though these ensembles may not perform as well as the Super Learner ensemble).
Adds the ability to train stacked ensembles on time-series data (where holdout data is "future" data compared to "past" data in training set).

Once we add this, we can add support for this in AutoML as well.

exalate-issue-sync[bot] commented 1 year ago

Nidhi Mehta commented: ref: https://support.h2o.ai/helpdesk/tickets/90970

exalate-issue-sync[bot] commented 1 year ago

Sebastien Poirier commented: Had a first look at this: should be able to keep StackedEnsemble for both stacking and blending strategies as the main difference is on how the level one frame is built. Will delegate this logic to some StackingStrategy class.

exalate-issue-sync[bot] commented 1 year ago

Erin LeDell commented: Sounds good. Let's discuss the API... I was thinking that we could have a new arg called holdout_frame to use to train the metalearner.

exalate-issue-sync[bot] commented 1 year ago

Sebastien Poirier commented: [~accountid:557058:afd6e9a4-1891-4845-98ea-b5d34a2bc42c], I was first thinking about reusing the validation frame but now I see that it would be wrong as the holdout frame is then used for training the metalearner, so yeah... we need another one. Not much a fan of {{holdout_frame}} to be honest because validation+test are also holdout frames and I feel like this word is a bit overloaded in ML in general to mean various sets depending on the context. I'd be more explicit: {{blending_training_frame}}, {{second_level_training_frame}}, or sth in the likes.

Also, wanted to ask: I've seen two main approaches to blending:

one is only using the model predictions on the holdout set as features for the blender.

another one is adding those predictions to the existing features on the holdout set before using it as a training set for the blender.

Which one do we want to use? First approach looks more in the spirit of stacking, and I'm wondering if the second approach even makes sense for datasets with large amount of predictors. wdyt?

exalate-issue-sync[bot] commented 1 year ago

Erin LeDell commented: [~accountid:5b153fb1b0d76456f36daced] Yeah good point, the name "holdout" is ambiguous. Both blending_training_frame or blending_frame seem like good options.

Approach #1 is what we want. That's the traditional approach. I've talked to [~accountid:557058:391327fd-0326-4a45-8dcd-7a42c5142fca] and others about this and they have not seen much value in the #2 approach in practice.

exalate-issue-sync[bot] commented 1 year ago

Sebastien Poirier commented: [~accountid:557058:afd6e9a4-1891-4845-98ea-b5d34a2bc42c] after implementing this with new blending_frame parameter added to {{StackedEnsembleModel}}, I'm just wondering why we don't simply use {{training_frame}} as usual plus a {{stacking_mode}} param. The fact is that when providing a {{blending_frame}}, we have absolutely no use of the {{training_frame}} itself.

The only API that would benefit from an additional frame would be {{AutoML}}... and I'm not even sure, most likely we would just split the {{training_frame}} internally and keep a split for SE. any thoughts?

exalate-issue-sync[bot] commented 1 year ago

Sebastien Poirier commented: The only use I can see of providing both {{training_frame}} and {{blending_frame}} to the SE model is for ensuring that all base models have been trained with a similar frame, and I'm not sure it should even be a requirement: currently, with CV-stacking we don't require this indeed, we just require that the base models have all been trained with a frame of same length, which makes sense because the {{level_one_frame}} is built from the cv-predictions on that frame, but for blending, this I don't see why we should require this.

exalate-issue-sync[bot] commented 1 year ago

Sebastien Poirier commented: OK, I think I found the reason why we need to keep {{training_frame}}: this is necessary for computing SE model {{training_metrics}} that are comparable to base models {{training_metrics}}. In this case, I'm leaving things as they are now with the {{blending_frame}} that acts also as a trigger for "blending mode". I'll also ensure that the {{training_frame}} passed to SE has same length as the one used to train base models as we currently do with CV stacking.

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-4680 Assignee: Sebastien Poirier Reporter: Erin LeDell State: Closed Fix Version: 3.24.0.1 Attachments: N/A Development PRs: Available

Linked PRs from JIRA

https://github.com/h2oai/h2o-3/pull/3199