AutoML: terrible predictions from SE on MNIST

exalate-issue-sync[bot] commented 1 year ago

From [~accountid:5bdad16c3abe092e841f782e]'s Kaggle Kernel : [https://www.kaggle.com/tunguz/mnist-with-h2o-automl?scriptVersionId=20451566|https://www.kaggle.com/tunguz/mnist-with-h2o-automl?scriptVersionId=20451566]

requires SE to be the leading model on top of 3XGB, 1DRF, 1 (bad) GLM: 8h training for MNIST.

The predictions of the leading SE on the predictions set produces only 2 different values.

exalate-issue-sync[bot] commented 1 year ago

Sebastien Poirier commented: [~accountid:5bdad16c3abe092e841f782e] FYI

exalate-issue-sync[bot] commented 1 year ago

Sebastien Poirier commented: Was able to reproduce and analyze the problem.

I strongly suspect that the origin of the issue is the GLM model built during the AutoML training part (not the one for metalearner). It appears that when we obtain a SE that gives terrible predictions on the test data, the GLM model itself was also terrible. Looking at the logs, we can see that this model itself was trained very quickly as the automl run was reaching its time limit, and the xval + final GLMs don't even predict any {{class 9}} for example.

I'm just surprised that SE can still be so strongly influenced by a very bad model: shouldn't we just remove it from the stack if its score metric can be considered as an outlier compared with other models?
it looks like the interruption on the training of the xval GLM models appears a bit too quickly: I'm afraid I need to review those time constraints again in the case of GLM...

exalate-issue-sync[bot] commented 1 year ago

Sebastien Poirier commented: Was able to reproduce on a sample of the original dataset after forging a SE made of 5 decent GBMs + 1 very bad GLM (produced by training it for only a few seconds). The produced SE was itself terrible, whereas a SE with only the GBMs is at least as good as the best GBM…

This shows that by default, our SE gives a too high weight to bad models.

Things to test:

is it specific to GLM metalearner? ** Compare with DRF metalearner.
is it specific to SE using stacked mode (as opposed to blending) ? Compare with blending. Check impact of validation frame when training the SE.

Suggestion for now is to add a feature to SE that will automatically remove form the stack the outlier models that can impact the SE negatively.

We should consider if we want to expose this parameter to SE API on clients (e.g. {{prune_outliers}}). In any case it will be enabled for {{AutoML}}.

Outliers will be identified as follow:

sort the base models in scoring metric order (best to worse).
Take the worse, compute a statistic to measure its impact on the average scoring (never forgetting that we only want to eliminate the ones with bad impact…): use {{1.5 IQR}} rule? use distance to the mean ({{t_score}}?).
If impact is too large, remove it, and repeat previous step.
otherwise stop.

exalate-issue-sync[bot] commented 1 year ago

Sebastien Poirier commented: [~accountid:557058:afd6e9a4-1891-4845-98ea-b5d34a2bc42c] FYI

Made some tests with a toy problem:

subset of MNIST dataset: 10% used for training frame, 10% for validation frame, 10% for blending frame, 10% for control test frame.
train 5 GBM models in grid search, using 5 folds xval.
train 1 bad GLM model by restricting training time to 6 seconds.

From this state, train various Stacked Ensemble models. Each SE is trained once with all models (GBMs + bad GLM) and once without the bad model (control model): {{incl. bad model}} column. It is first trained with default params (passing only training frame), and then one param is changed:

{{validation}}: use a validation frame for SE training.
{{mode}}: use blending mode instead of default stacking mode.
{{algo}}: use DRF for metalearner instead of GLM.
training and test scores are obtained by computing {{mean_per_class_error}}.

Here are the results:

{code}	algo	mode	incl. bad model	validation	training score	test score	predictions classes
GLM	stacking	True	False	0.9	0.9	[9]
GLM	stacking	False	False	0.00209	0.10563	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
GLM	stacking	True	True	0.00235	0.10721	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
GLM	stacking	False	True	0.00209	0.10667	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
GLM	blending	True	False	0.00943	0.09421	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
GLM	blending	False	False	0.00371	0.09944	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
DRF	stacking	True	False	0.00387	0.1092	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
DRF	stacking	False	False	0.0026	0.10268	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]	{code}

Only the default config seems to cause an issue:

GLM metalearner.
stacking mode (based on CV predictions).
no validation frame.

Please also note that with the default config, the SE does worse than its worse model (the bad GLM):

glm model training score = {{0.40693}}
glm model test score = {{0.40855}}
glm model predicts all 10 classes.

Given those results, I’m now reluctant to remove the “outlier models” from the stack as previously planned:

this would add complexity to the StackedEnsemble algo.
this seems to be an issue only with GLM metalearner in some specific conditions.
unfortunately, those specific conditions are also the defaults in our SE implementation.
{{AutoML}} is also using those defaults.

exalate-issue-sync[bot] commented 1 year ago

Sebastien Poirier commented: After further investigation, it appeared that disabling GLM {{standardization}} for the metalearner seems to fix the issue. Also, analyzing {{predict}} behaviour showed that it worked as expected, even with the broken SE.

Comparing beta coefficients on the GLM metalearner for the good SE and the broken one on a toy problem:

Good SE (standardization disabled): intercept + betas associated to features from the bad model

{code}| 0 | 0 | 0 | 0.26224 | 0 | 0.289347 | 0 | 0 | 0 | 0.358579 | 0.275356 | ... | 51 | 3.0863 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.270053 | | 52 | 0 | 1.59484 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | 53 | 0 | 0 | 3.02257 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | 54 | 0 | 0 | 0 | 3.507 | 0 | 0 | 0 | 0 | 0 | 0 | | 55 | 0 | 0 | 0 | 0 | 1.33707 | 0 | 0 | 0 | 0 | 1.00134 | | 56 | 0 | 0 | 0 | 0 | 0 | 0.322885 | 0 | 0 | 0.165709 | 0 | | 57 | 0 | 0 | 0 | 0 | 0 | 0 | 3.62586 | 0 | 0 | 0 | | 58 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4.01486 | 0 | 0 | | 59 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.34281 | 0 | | 60 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |{code}

Bad SE (standardization enabled): intercept + betas associated to features from bad model

{code}| 0 | -7.64345 | -5.85679 | -8.30032 | -10.0426 | -15.1905 | -9.02275 | -7.20688 | -8.52417 | -9.23724 | -6.68242 | ... | 51 | 7.04674 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2.3978 | | 52 | 0 | 16.7166 | 7.36956 | 8.81905 | 9.16294 | 0 | 1.43496 | 1.76902 | 0.284465 | 0 | | 53 | 0 | 0.0409441 | 25.4871 | 1.58966 | 0 | 17.9711 | 1.91301 | 4.89572 | 7.9763 | 0 | | 54 | 0 | 0.00103425 | 8.586 | 45.0089 | 0 | 8.8245 | 0 | 3.4896 | 0 | 0 | | 55 | 0 | 0.000485 | 0 | 0 | 51.9333 | 0 | 1.46407 | 10.538 | 2.44376 | 11.6284 | | 56 | 31.9851 | 4.06781 | 0 | 6.85689 | 28.8442 | 37.0869 | 0 | 0 | 23.0725 | 0 | | 57 | 4.03292 | 0 | 3.83829 | 2.95218 | 18.9302 | 0 | 32.0586 | 0 | 0 | 0 | | 58 | 0 | 0 | 0 | 2.35899 | 0 | 0 | 0 | 27.642 | 1.03843 | 0 | | 59 | 0 | 0.360441 | 12.35 | 0 | 2.51519 | 0.000824895 | 0 | 0 | 31.8062 | 8.06343 | | 60 | 105.623 | 0 | 0 | 0 | 0 | 0 | 0 | 370.862 | 0 | 1317.29 |{code}

as we see, when normalization is enabled, the betas associated to the bad model can take extreme values (all other 0 < betas < 2), so the model training is responsible for the difference.

exalate-issue-sync[bot] commented 1 year ago

Sebastien Poirier commented: !legacy_sefix_binary.png|width=631,height=278!

!legacy_sefix_multiclass.png|width=652,height=262!

Ran short AutoML benchmark, legacy vs. SE fix to ensure that this fix doesn’t degrade perf. Used benchmark definitions: [https://github.com/openml/automlbenchmark/blob/v1.0/resources/benchmarks/small-8c1h.yaml|https://github.com/openml/automlbenchmark/blob/v1.0/resources/benchmarks/small-8c1h.yaml|smart-link] [https://github.com/openml/automlbenchmark/blob/v1.0/resources/benchmarks/medium-8c1h.yaml|https://github.com/openml/automlbenchmark/blob/v1.0/resources/benchmarks/medium-8c1h.yaml|smart-link]

all tasks, but only 1 fold for 1h.

Compared with reference h2oautoml results (used only fold 0 for comparison):

[https://github.com/openml/automlbenchmark/blob/v1.0/reports/results_small-8c1h.csv|https://github.com/openml/automlbenchmark/blob/v1.0/reports/results_small-8c1h.csv|smart-link]

[https://github.com/openml/automlbenchmark/blob/v1.0/reports/results_medium-8c1h.csv|https://github.com/openml/automlbenchmark/blob/v1.0/reports/results_medium-8c1h.csv|smart-link]

The fix doesn’t seem to have any significant impact on perf:

blood-transfusion is worse but leader is XGB in both cases.
jungle-chess is significantly worse here, and SE is leader in both, rerunning it with different seed for confirmation.

exalate-issue-sync[bot] commented 1 year ago

Sebastien Poirier commented: After second run of {{jungle-chess}} against the fix branch, the result was much closer to the h2oautoml ref this time (still slightly below though): funny how seeds can be significant sometimes.

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-6874 Assignee: Sebastien Poirier Reporter: Sebastien Poirier State: Resolved Fix Version: 3.26.0.8 Attachments: Available (Count: 2) Development PRs: Available

Linked PRs from JIRA

https://github.com/h2oai/h2o-3/pull/3960

Attachments From Jira

Attachment Name: legacy_sefix_binary.png Attached By: Sebastien Poirier File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-6874/legacy_sefix_binary.png

Attachment Name: legacy_sefix_multiclass.png Attached By: Sebastien Poirier File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-6874/legacy_sefix_multiclass.png

h2oai / h2o-3

AutoML: terrible predictions from SE on MNIST #8761