Open exalate-issue-sync[bot] opened 1 year ago
Sebastien Poirier commented: [~accountid:557058:afd6e9a4-1891-4845-98ea-b5d34a2bc42c] maybe we could also progressively save models to disk instead of keeping them in memory. This could be based on either conditions:
Erin LeDell commented: [~accountid:5b153fb1b0d76456f36daced] Yeah we should have the option of writing them to disk vs just deleting them.
JIRA Issue Migration Info
Jira Issue: PUBDEV-5146 Assignee: UNASSIGNED Reporter: Erin LeDell State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A
If we added the ability to keep the "best" group of k models, then we could recycle memory in the H2O cluster and run the AutoML training process for much longer and train more models. After a certain point, having so many models will use all the memory of the H2O cluster and force us to stop training.
Since most of the time (and by default) we train ensembles in AutoML, the "best k" models to keep may not be the "top k" models based on model performance (though that would be the naive way to do it). Instead, it would be the collection of models that contains some of the best performing models, but also has a lot of diversity so that the ensemble is the best possible ensemble of k models. We need to establish a mechanism for choosing the best group of k models for the ensemble w/o enumerating through all the possible ensembles of k models.
The approach will be based on minimizing the correlation between the errors of the base learners.
This is related to: https://0xdata.atlassian.net/browse/PUBDEV-4053