h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.89k stars 2k forks source link

AutoML: terrible predictions from SE on MNIST #8761

Closed exalate-issue-sync[bot] closed 1 year ago

exalate-issue-sync[bot] commented 1 year ago

From [~accountid:5bdad16c3abe092e841f782e]'s Kaggle Kernel : [https://www.kaggle.com/tunguz/mnist-with-h2o-automl?scriptVersionId=20451566|https://www.kaggle.com/tunguz/mnist-with-h2o-automl?scriptVersionId=20451566]

requires SE to be the leading model on top of 3XGB, 1DRF, 1 (bad) GLM: 8h training for MNIST.

The predictions of the leading SE on the predictions set produces only 2 different values.

exalate-issue-sync[bot] commented 1 year ago

Sebastien Poirier commented: [~accountid:5bdad16c3abe092e841f782e] FYI

exalate-issue-sync[bot] commented 1 year ago

Sebastien Poirier commented: Was able to reproduce and analyze the problem.

I strongly suspect that the origin of the issue is the GLM model built during the AutoML training part (not the one for metalearner). It appears that when we obtain a SE that gives terrible predictions on the test data, the GLM model itself was also terrible. Looking at the logs, we can see that this model itself was trained very quickly as the automl run was reaching its time limit, and the xval + final GLMs don't even predict any {{class 9}} for example.

  1. I'm just surprised that SE can still be so strongly influenced by a very bad model: shouldn't we just remove it from the stack if its score metric can be considered as an outlier compared with other models?
  2. it looks like the interruption on the training of the xval GLM models appears a bit too quickly: I'm afraid I need to review those time constraints again in the case of GLM...
exalate-issue-sync[bot] commented 1 year ago

Sebastien Poirier commented: Was able to reproduce on a sample of the original dataset after forging a SE made of 5 decent GBMs + 1 very bad GLM (produced by training it for only a few seconds). The produced SE was itself terrible, whereas a SE with only the GBMs is at least as good as the best GBM…

This shows that by default, our SE gives a too high weight to bad models.

Things to test:

Suggestion for now is to add a feature to SE that will automatically remove form the stack the outlier models that can impact the SE negatively.

We should consider if we want to expose this parameter to SE API on clients (e.g. {{prune_outliers}}). In any case it will be enabled for {{AutoML}}.

Outliers will be identified as follow:

exalate-issue-sync[bot] commented 1 year ago

Sebastien Poirier commented: [~accountid:557058:afd6e9a4-1891-4845-98ea-b5d34a2bc42c] FYI

Made some tests with a toy problem:

From this state, train various Stacked Ensemble models. Each SE is trained once with all models (GBMs + bad GLM) and once without the bad model (control model): {{incl. bad model}} column. It is first trained with default params (passing only training frame), and then one param is changed:

Here are the results:

{code} algo mode incl. bad model validation training score test score predictions classes
GLM stacking True False 0.9 0.9 [9]
GLM stacking False False 0.00209 0.10563 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
GLM stacking True True 0.00235 0.10721 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
GLM stacking False True 0.00209 0.10667 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
GLM blending True False 0.00943 0.09421 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
GLM blending False False 0.00371 0.09944 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
DRF stacking True False 0.00387 0.1092 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
DRF stacking False False 0.0026 0.10268 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] {code}

Only the default config seems to cause an issue:

Please also note that with the default config, the SE does worse than its worse model (the bad GLM):

Given those results, I’m now reluctant to remove the “outlier models” from the stack as previously planned:

exalate-issue-sync[bot] commented 1 year ago

Sebastien Poirier commented: After further investigation, it appeared that disabling GLM {{standardization}} for the metalearner seems to fix the issue. Also, analyzing {{predict}} behaviour showed that it worked as expected, even with the broken SE.

Comparing beta coefficients on the GLM metalearner for the good SE and the broken one on a toy problem:

Good SE (standardization disabled): intercept + betas associated to features from the bad model

{code}| 0 | 0 | 0 | 0.26224 | 0 | 0.289347 | 0 | 0 | 0 | 0.358579 | 0.275356 | ... | 51 | 3.0863 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.270053 | | 52 | 0 | 1.59484 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | 53 | 0 | 0 | 3.02257 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | 54 | 0 | 0 | 0 | 3.507 | 0 | 0 | 0 | 0 | 0 | 0 | | 55 | 0 | 0 | 0 | 0 | 1.33707 | 0 | 0 | 0 | 0 | 1.00134 | | 56 | 0 | 0 | 0 | 0 | 0 | 0.322885 | 0 | 0 | 0.165709 | 0 | | 57 | 0 | 0 | 0 | 0 | 0 | 0 | 3.62586 | 0 | 0 | 0 | | 58 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4.01486 | 0 | 0 | | 59 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.34281 | 0 | | 60 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |{code}

Bad SE (standardization enabled): intercept + betas associated to features from bad model

{code}| 0 | -7.64345 | -5.85679 | -8.30032 | -10.0426 | -15.1905 | -9.02275 | -7.20688 | -8.52417 | -9.23724 | -6.68242 | ... | 51 | 7.04674 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2.3978 | | 52 | 0 | 16.7166 | 7.36956 | 8.81905 | 9.16294 | 0 | 1.43496 | 1.76902 | 0.284465 | 0 | | 53 | 0 | 0.0409441 | 25.4871 | 1.58966 | 0 | 17.9711 | 1.91301 | 4.89572 | 7.9763 | 0 | | 54 | 0 | 0.00103425 | 8.586 | 45.0089 | 0 | 8.8245 | 0 | 3.4896 | 0 | 0 | | 55 | 0 | 0.000485 | 0 | 0 | 51.9333 | 0 | 1.46407 | 10.538 | 2.44376 | 11.6284 | | 56 | 31.9851 | 4.06781 | 0 | 6.85689 | 28.8442 | 37.0869 | 0 | 0 | 23.0725 | 0 | | 57 | 4.03292 | 0 | 3.83829 | 2.95218 | 18.9302 | 0 | 32.0586 | 0 | 0 | 0 | | 58 | 0 | 0 | 0 | 2.35899 | 0 | 0 | 0 | 27.642 | 1.03843 | 0 | | 59 | 0 | 0.360441 | 12.35 | 0 | 2.51519 | 0.000824895 | 0 | 0 | 31.8062 | 8.06343 | | 60 | 105.623 | 0 | 0 | 0 | 0 | 0 | 0 | 370.862 | 0 | 1317.29 |{code}

as we see, when normalization is enabled, the betas associated to the bad model can take extreme values (all other 0 < betas < 2), so the model training is responsible for the difference.

exalate-issue-sync[bot] commented 1 year ago

Sebastien Poirier commented: !legacy_sefix_binary.png|width=631,height=278!

!legacy_sefix_multiclass.png|width=652,height=262!

Ran short AutoML benchmark, legacy vs. SE fix to ensure that this fix doesn’t degrade perf. Used benchmark definitions: [https://github.com/openml/automlbenchmark/blob/v1.0/resources/benchmarks/small-8c1h.yaml|https://github.com/openml/automlbenchmark/blob/v1.0/resources/benchmarks/small-8c1h.yaml|smart-link] [https://github.com/openml/automlbenchmark/blob/v1.0/resources/benchmarks/medium-8c1h.yaml|https://github.com/openml/automlbenchmark/blob/v1.0/resources/benchmarks/medium-8c1h.yaml|smart-link]

all tasks, but only 1 fold for 1h.

Compared with reference h2oautoml results (used only fold 0 for comparison):

[https://github.com/openml/automlbenchmark/blob/v1.0/reports/results_small-8c1h.csv|https://github.com/openml/automlbenchmark/blob/v1.0/reports/results_small-8c1h.csv|smart-link]

[https://github.com/openml/automlbenchmark/blob/v1.0/reports/results_medium-8c1h.csv|https://github.com/openml/automlbenchmark/blob/v1.0/reports/results_medium-8c1h.csv|smart-link]

The fix doesn’t seem to have any significant impact on perf:

exalate-issue-sync[bot] commented 1 year ago

Sebastien Poirier commented: After second run of {{jungle-chess}} against the fix branch, the result was much closer to the h2oautoml ref this time (still slightly below though): funny how seeds can be significant sometimes.

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-6874 Assignee: Sebastien Poirier Reporter: Sebastien Poirier State: Resolved Fix Version: 3.26.0.8 Attachments: Available (Count: 2) Development PRs: Available

Linked PRs from JIRA

https://github.com/h2oai/h2o-3/pull/3960

Attachments From Jira

Attachment Name: legacy_sefix_binary.png Attached By: Sebastien Poirier File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-6874/legacy_sefix_binary.png

Attachment Name: legacy_sefix_multiclass.png Attached By: Sebastien Poirier File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-6874/legacy_sefix_multiclass.png