Closed exalate-issue-sync[bot] closed 1 year ago
Sebastien Poirier commented: [~accountid:5bdad16c3abe092e841f782e] FYI
Sebastien Poirier commented: Was able to reproduce and analyze the problem.
I strongly suspect that the origin of the issue is the GLM model built during the AutoML training part (not the one for metalearner). It appears that when we obtain a SE that gives terrible predictions on the test data, the GLM model itself was also terrible. Looking at the logs, we can see that this model itself was trained very quickly as the automl run was reaching its time limit, and the xval + final GLMs don't even predict any {{class 9}} for example.
Sebastien Poirier commented: Was able to reproduce on a sample of the original dataset after forging a SE made of 5 decent GBMs + 1 very bad GLM (produced by training it for only a few seconds). The produced SE was itself terrible, whereas a SE with only the GBMs is at least as good as the best GBM…
This shows that by default, our SE gives a too high weight to bad models.
Things to test:
Suggestion for now is to add a feature to SE that will automatically remove form the stack the outlier models that can impact the SE negatively.
We should consider if we want to expose this parameter to SE API on clients (e.g. {{prune_outliers}}). In any case it will be enabled for {{AutoML}}.
Outliers will be identified as follow:
Sebastien Poirier commented: [~accountid:557058:afd6e9a4-1891-4845-98ea-b5d34a2bc42c] FYI
Made some tests with a toy problem:
From this state, train various Stacked Ensemble models. Each SE is trained once with all models (GBMs + bad GLM) and once without the bad model (control model): {{incl. bad model}} column. It is first trained with default params (passing only training frame), and then one param is changed:
Here are the results:
{code} | algo | mode | incl. bad model | validation | training score | test score | predictions classes |
---|---|---|---|---|---|---|---|
GLM | stacking | True | False | 0.9 | 0.9 | [9] | |
GLM | stacking | False | False | 0.00209 | 0.10563 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] | |
GLM | stacking | True | True | 0.00235 | 0.10721 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] | |
GLM | stacking | False | True | 0.00209 | 0.10667 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] | |
GLM | blending | True | False | 0.00943 | 0.09421 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] | |
GLM | blending | False | False | 0.00371 | 0.09944 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] | |
DRF | stacking | True | False | 0.00387 | 0.1092 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] | |
DRF | stacking | False | False | 0.0026 | 0.10268 | [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] | {code} |
Only the default config seems to cause an issue:
Please also note that with the default config, the SE does worse than its worse model (the bad GLM):
Given those results, I’m now reluctant to remove the “outlier models” from the stack as previously planned:
Sebastien Poirier commented: After further investigation, it appeared that disabling GLM {{standardization}} for the metalearner seems to fix the issue. Also, analyzing {{predict}} behaviour showed that it worked as expected, even with the broken SE.
Comparing beta coefficients on the GLM metalearner for the good SE and the broken one on a toy problem:
Good SE (standardization disabled): intercept + betas associated to features from the bad model
{code}| 0 | 0 | 0 | 0.26224 | 0 | 0.289347 | 0 | 0 | 0 | 0.358579 | 0.275356 | ... | 51 | 3.0863 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.270053 | | 52 | 0 | 1.59484 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | 53 | 0 | 0 | 3.02257 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | 54 | 0 | 0 | 0 | 3.507 | 0 | 0 | 0 | 0 | 0 | 0 | | 55 | 0 | 0 | 0 | 0 | 1.33707 | 0 | 0 | 0 | 0 | 1.00134 | | 56 | 0 | 0 | 0 | 0 | 0 | 0.322885 | 0 | 0 | 0.165709 | 0 | | 57 | 0 | 0 | 0 | 0 | 0 | 0 | 3.62586 | 0 | 0 | 0 | | 58 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4.01486 | 0 | 0 | | 59 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.34281 | 0 | | 60 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |{code}
Bad SE (standardization enabled): intercept + betas associated to features from bad model
{code}| 0 | -7.64345 | -5.85679 | -8.30032 | -10.0426 | -15.1905 | -9.02275 | -7.20688 | -8.52417 | -9.23724 | -6.68242 | ... | 51 | 7.04674 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2.3978 | | 52 | 0 | 16.7166 | 7.36956 | 8.81905 | 9.16294 | 0 | 1.43496 | 1.76902 | 0.284465 | 0 | | 53 | 0 | 0.0409441 | 25.4871 | 1.58966 | 0 | 17.9711 | 1.91301 | 4.89572 | 7.9763 | 0 | | 54 | 0 | 0.00103425 | 8.586 | 45.0089 | 0 | 8.8245 | 0 | 3.4896 | 0 | 0 | | 55 | 0 | 0.000485 | 0 | 0 | 51.9333 | 0 | 1.46407 | 10.538 | 2.44376 | 11.6284 | | 56 | 31.9851 | 4.06781 | 0 | 6.85689 | 28.8442 | 37.0869 | 0 | 0 | 23.0725 | 0 | | 57 | 4.03292 | 0 | 3.83829 | 2.95218 | 18.9302 | 0 | 32.0586 | 0 | 0 | 0 | | 58 | 0 | 0 | 0 | 2.35899 | 0 | 0 | 0 | 27.642 | 1.03843 | 0 | | 59 | 0 | 0.360441 | 12.35 | 0 | 2.51519 | 0.000824895 | 0 | 0 | 31.8062 | 8.06343 | | 60 | 105.623 | 0 | 0 | 0 | 0 | 0 | 0 | 370.862 | 0 | 1317.29 |{code}
as we see, when normalization is enabled, the betas associated to the bad model can take extreme values (all other 0 < betas < 2), so the model training is responsible for the difference.
Sebastien Poirier commented: !legacy_sefix_binary.png|width=631,height=278!
!legacy_sefix_multiclass.png|width=652,height=262!
Ran short AutoML benchmark, legacy vs. SE fix to ensure that this fix doesn’t degrade perf. Used benchmark definitions: [https://github.com/openml/automlbenchmark/blob/v1.0/resources/benchmarks/small-8c1h.yaml|https://github.com/openml/automlbenchmark/blob/v1.0/resources/benchmarks/small-8c1h.yaml|smart-link] [https://github.com/openml/automlbenchmark/blob/v1.0/resources/benchmarks/medium-8c1h.yaml|https://github.com/openml/automlbenchmark/blob/v1.0/resources/benchmarks/medium-8c1h.yaml|smart-link]
all tasks, but only 1 fold for 1h.
Compared with reference h2oautoml results (used only fold 0 for comparison):
The fix doesn’t seem to have any significant impact on perf:
Sebastien Poirier commented: After second run of {{jungle-chess}} against the fix branch, the result was much closer to the h2oautoml ref this time (still slightly below though): funny how seeds can be significant sometimes.
JIRA Issue Migration Info
Jira Issue: PUBDEV-6874 Assignee: Sebastien Poirier Reporter: Sebastien Poirier State: Resolved Fix Version: 3.26.0.8 Attachments: Available (Count: 2) Development PRs: Available
Linked PRs from JIRA
https://github.com/h2oai/h2o-3/pull/3960
Attachments From Jira
Attachment Name: legacy_sefix_binary.png Attached By: Sebastien Poirier File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-6874/legacy_sefix_binary.png
Attachment Name: legacy_sefix_multiclass.png Attached By: Sebastien Poirier File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-6874/legacy_sefix_multiclass.png
From [~accountid:5bdad16c3abe092e841f782e]'s Kaggle Kernel : [https://www.kaggle.com/tunguz/mnist-with-h2o-automl?scriptVersionId=20451566|https://www.kaggle.com/tunguz/mnist-with-h2o-automl?scriptVersionId=20451566]
requires SE to be the leading model on top of 3XGB, 1DRF, 1 (bad) GLM: 8h training for MNIST.
The predictions of the leading SE on the predictions set produces only 2 different values.