unit8co / darts

A python library for user-friendly forecasting and anomaly detection on time series.
https://unit8co.github.io/darts/
Apache License 2.0
8.08k stars 880 forks source link

[BUG] RegressionEnsembleModel.fit() does not have ability to use validation data to fit either base learners or ensemble learner #1785

Closed mg10011 closed 1 year ago

mg10011 commented 1 year ago

Describe the bug The RegressionEnsembleModel does not allow a val_series or val_past_covariates to be passed in training the base models that are part of the ensemble. As a result, right now, it seems impossible to use 'val_loss' or other metrics of the validation set to determine EarlyStopping criteria.

Even if I fit the individual base models separately, it appears that the fit() member function restarts the fitting for each of them.

System (please complete the following information):

madtoinou commented 1 year ago

Hi @mg10011,

For RegressionEnsembleModel, the definition of a validation set is a big ambiguous: the model is already setting aside the last regression_train_n_points values to train the regression model. Would you use the same timestamps as validation set for the forecasting models of the ensemble? It feels a little bit like data leakage to me...

For the base class EnsembleModel, the validation set at training time makes sense for the deep learning models but not for models such as ARIMA or ExponentialSmoothing. The feature was probably not implemented in the first place for this reason. EnsembleModel.fit() could eventually accept the argument and pass it only to the models supporting it or EnsembleModels could accept "pre-trained" models which will then be "finetuned" when fit() is called (without validation set but if the model has converged, should not be too much of a problem).

WDYT @dennisbader?

mg10011 commented 1 year ago

This feels like a bug. The reason being that RegressionEnsembleModel does not allow the creation of an ensemble of fitted models. And for neural net based models, using the validation set as a criteria for early stopping is essential (as opposed to all the training data). Thus, the ensembles don't allow for ensembling of NN-based models.

Perhaps it would be better to declare the RegressionEnsembleModel object first and then individually fit each base model with the appropriate stopping criteria. Thoughts?

madtoinou commented 1 year ago

Accepting pre-trained nn-based models in EnsembleModel is on darts' roadmap but will require some work to limit the risk of data leakage or decide if mixing of untrained regression-based models and pre-trained nn-based models should be allowed for example.

If you optimize the models parameters (architecture, learning rate and number of training epochs) before instantiating the EnsembleModel, early stopping should not be necessary to avoid over-fitting. This approach is non-optimal if the model training is very time consuming but hyper-parameters optimization is probably already part of your pipeline anyway...