exalate-issue-sync[bot] commented 1 year ago

Hey there,

Some production code used for training models broke when no Stacked Ensembles were trained during several AutoML runs using the latest R version of H2O. Upon further inspection, we were able to reproduce the issue with the following code… it appears that specifying {{max_models}} creates a situation that contradicts documentation indicating Stacked Ensembles are always trained as part of AutoML, so reporting that with a reprex here:

{code:r}library(tidyverse) library(h2o) data(iris)

h2o.init()

iris_df <- iris %>% as_tibble() iris_df_h2o <- iris_df %>% as.h2o()

Stacked Ensemble does generate

aml <- h2o.automl( y = 'Species', training_frame = iris_df_h2o, max_runtime_secs = 60 )

Stacked Ensemble does not generate

aml2 <- h2o.automl( y = 'Species', training_frame = iris_df_h2o, max_runtime_secs = 60, max_models = 50, seed = 1, exploitation_ratio = .05 )

Stacked Ensemble does not generate

aml3 <- h2o.automl( y = 'Species', training_frame = iris_df_h2o, max_runtime_secs = 60, max_models = 50, seed = 1#,

exploitation_ratio = .05

Stacked Ensemble does not generate

aml4 <- h2o.automl( y = 'Species', training_frame = iris_df_h2o, max_runtime_secs = 60, max_models = 50#,

seed = 1,

    # exploitation_ratio = .05
)

Stacked Ensemble DOES generate

aml5 <- h2o.automl( y = 'Species', training_frame = iris_df_h2o, max_runtime_secs = 60,

max_models = 50,

    seed = 1,
    exploitation_ratio = .05
){code}

exalate-issue-sync[bot] commented 1 year ago

Sebastien Poirier commented: [~accountid:5cc0b0886fbf5a10040d2945] can you please tell which version you’re using as we did some changes in the last couple of major releases related to this.

Note that documentation from [https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html#required-stopping-parameters|https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html#required-stopping-parameters|smart-link] also says:

{noformat}When both options are set, then the AutoML run will stop as soon as it hits either of these limits.{noformat}

this is what happens in your examples above, you’re setting both {{max_runtime_secs}} and {{max_models}}.

To give you a better insights, I’ll explain roughly how this works:

if providing only {{max_runtime_secs}}, then we try to distribute this training time accross various algos and SE, which allows the training of at least one SE as early as possible.
if providing only {{max_models}}, as we don’t have any time budget, we simply distribute the models to be trained accross various algos and then finally train the SEs.
if providing both, for consistency and reproducibility, the {{max_models}} actually takes precedence to understand how the training works and the {{max_runtime_secs}} is purely interpreted as a cap (individual models are not budgeted): the training will then stop after the limit, if there is at least 2 models, some SEs will also be trained, but if the time limit was too short to even fully train a model, then there’s no possibility to train any SE.

{quote}Some production code used for training models broke when no Stacked Ensembles were trained{quote}

You should not expect that you will always have SEs. The training of SEs could raise some error (for one reason or another) even if several base models were trained, and then the trained AutoML will still behave normally. Also, as mentioned, if we were not able to train at least 2 base models in the given time budget, then we can’t train any SE.

exalate-issue-sync[bot] commented 1 year ago

Kunal Mishra commented: Hey Sebastien,

As noted on the ticket, the version we were using was 3.36.1.2. I believe that’s also the latest version available in R via CRAN
Acknowledging this:

{quote}Some production code used for training models broke when no Stacked Ensembles were trained{quote}

We have some assertions in place which alerted us to it not being the case that any SEs had trained. Though the reprex includes 60 second training times, in reality the models have ~24 hours to train on a large server, so we would expect > 2 base models to train 100% of the time, and it became clear that regardless of the number of models (50) and max training time (86,400) specified via arguments, both being specified at all seemed to block the training of SEs at the end. It seems unlikely there was an issue assembling the SEs since they built with no issues as soon as I commented out {{max_models}}, and feels more likely that attempting to build them just didn’t occur at all.
The training of multiple SEs throughout the process seems to be a relatively new change, so the new and slightly more complicated logic/strategies put in place make sense given that, but in the past I believe H2O used the allotted time to train base models and then at the conclusion would always trained 2 SEs (best of family and all models), and I would generally expect that to occur in nearly all scenarios (as long as like you said, there’s >= 2 base models and there’s no error assembling the SEs)

As for next steps, I think it’d probably make sense to test a dev version on your end with the reprexes above to see if the issue persists (the last bullet explains what I’ve come to expect from H2O’s past versions and is behavior that I think would make sense even if both args are specified). If the issue can be reproduced, there might be some edge case logic to build that ensures building 2 SEs is at least attempted following the expiration of the max_runtime_secs budget (in H2O’s unit testing, I’d then {{assert}} that all {{aml}} objects in the reprex should have an SE as long as they have > 2 base models for future releases).

exalate-issue-sync[bot] commented 1 year ago

Sebastien Poirier commented: OK, I think I get it: you expect the combined {{max_models}}+{{max_runtimes_secs}} to work as it used to until {{3.32.1.x}} included, in the sense that even if the {{max_runtime_secs}} expired before {{max_models}} models have been trained, then AutoML should train 2 SEs on top, regardless how long it takes, and therefore making the {{max_runtime_secs}} harder to understand for common user.

The problem here is that the old behaviour regarding SEs, although partly consistent with old documentation, was inconsistent regarding the {{max_runtime_secs}} semantic/expectations, and you usually don’t fix an inconsistency by keeping it only in some cases.

[~accountid:557058:afd6e9a4-1891-4845-98ea-b5d34a2bc42c] what do you think of this use-case?

I see 4 ways to handle this:

leave it as is: {{max_runtime_secs}} behaves consistently, nothing is trained anymore once the time budget is exhausted, regardless of whether {{max_models}} is specified or not. When {{max_models}} is specified, it also means that the final SEs are not systematically trained.

ensure that SEs are always trained as soon as {{max_models}} is specified, potentially ignoring the {{max_runtime_secs}} if it is also specified and expires before having trained {{max_models}}. In this case, we would update doc saying that when combined with {{max_models}}, {{max_runtime_secs}} specifies the max amount of time to train the base models only, but the AutoML run itself can run significantly longer.

when both are specified, we could cheat regarding the {{max_runtime_secs}} time limit: as {{max_models}} is also specified in this case, then the training workflow is also simpler because with {{max_models}} we want to focus on reproducibility; which means that we could measure the time taken to train the first model, and based on this, reevaluate the time the max training time of base models (e.g. approximately save twice this amount of time for SEs): it’s extremely approximative and imperfect given that we don’t know how next models and SEs really need, and with some tricky logic, we can probably reach a final duration around {{max_runtime_secs}} +/- 30%…

same as #1 + add an option to force SE training, no matter what.

Personally, I’m not a fan at all of #2: it falls back to past confusions and reinforces the distinction between “base models” and SEs, defining H2O-AutoML as a tool whose main goal is to produce SE models (instead of producing the most accurate, interpretable, fair… depending on the use-case). I also find that #3 adds unnecessary complexity for almost no benefit.

4, although not very elegant (except if we decide to use a better way to communicate this than an additional parameter, like a global behaviour set through env variable for all AutoML runs), is a practical solution. It would also add complexity to handle some edge cases, but much less than in #3 for example.

To be honest, I still struggle to understand users expectations when using both parameters. If reproducibility is important, {{max_models}} should be used, otherwise I’d avoid using it and just specify a time budget. If on top of this, the total runtime must be capped to avoid waste of resources, then I don’t see why by default AutoML should keep training after this cap because we still have some SEs to train (for how long? no one knows!).

Thoughts?

exalate-issue-sync[bot] commented 1 year ago

Kunal Mishra commented: So yes, I agree that our expectations given the previous version of H2O (3.32.x) didn’t quite fulfill here. However, the reason they did not, regardless of the rest of this debate, still feels like a bug where it’s unclear why the SEs are not being trained at any point during the reprex. Even when ample time and max models are specified, they’re still not being trained, which feels like a flaw with the new logic somewhere worth investigating.

I also agree for the most part with your evaluation of the options. An option to force SE training (enabled by default?) is probably the easiest and most commonsense option moving forward. I think one possible complication that doesn’t affect my use case and thinking is the possibility of specifying non GLM metalearners in the AutoML call (otherwise deciding how much time to “save” for training SEs at the end would be relatively simple… right? GLMs train nearly instantaneously in non-resource-starvation scenarios).

And then user expectation when using both parameters on our end at least is to use the budgeted time to train 50 “deeper” models rather than potentially hundreds of “shallower” models with the significant resources being thrown at the problem, at least when the AutoML was initially being specified for this problem with a 3.32.x version. We always use the Best of Family Stacked Ensembles outputted as MOJOs in production (using an SE with all models on the leaderboard was too inefficient at time of prediction so we use the lighter SE). Then the incremental benefit of a suite of individually better models with more time and resources poured into each of them was higher this way then training a huge variety of lighter, shallower models and taking the best ~6 of them. And the reason we specified a max runtime seconds (again at that time – I haven’t verified this assumption recently) was that training didn’t seem to complete ever, which I know is probably modifiable behavior via early stopping behavior or metrics but max runtime secs is by far the easier to use, manipulate, talk about, and document for future use.

exalate-issue-sync[bot] commented 1 year ago

Kunal Mishra commented: It… looks like this was fixed in more recent version of H2O? When specifying max_models of 5, for example, in a few minutes it looks like 5 base models were trained and then 2 stacked ensembles (an all & best of family) before the completion of the automl call, which is the behavior we had expected.

Increasing max_runtime_secs and keeping max_models the same led to the desired behavior of “deeper” individual models with more compute time per model while still retaining the SEs, which we are happy about.

h2o-ops commented 1 year ago

JIRA Issue Details

Jira Issue: PUBDEV-8844 Assignee: Sebastien Poirier Reporter: Kunal Mishra State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A

h2oai / h2o-3

AutoML does not generated Stacked Ensembles when max_models is specified #6589

Stacked Ensemble does generate

Stacked Ensemble does not generate

Stacked Ensemble does not generate

exploitation_ratio = .05

Stacked Ensemble does not generate

seed = 1,

Stacked Ensemble DOES generate

max_models = 50,

leave it as is: {{max_runtime_secs}} behaves consistently, nothing is trained anymore once the time budget is exhausted, regardless of whether {{max_models}} is specified or not. When {{max_models}} is specified, it also means that the final SEs are not systematically trained.

same as #1 + add an option to force SE training, no matter what.