unit8co / darts

A python library for user-friendly forecasting and anomaly detection on time series.
https://unit8co.github.io/darts/
Apache License 2.0
7.88k stars 851 forks source link

[Question] Ensemble model on a mix of local and global models using historical_forecasts on a univariate series #2456

Closed stamm1989 closed 1 week ago

stamm1989 commented 1 month ago

Hi all,

I have some questions about the ensemble models. I’ve been training a lgbm and arima model on the same univariate time series. Now I wanted to use these fitted models inside the RegressionEnsemble, as I understand ensemble models, I’m training a meta-model where out-of-sample-predictions from base-models are the input features for the meta-model. The RegressionEmsemble seems to provide this, using train_using_historical_forecasts=True, however my setup does not seem to be supported. The behaviour of setting train_using_historical_forecasts =False would trigger a to-far into the future prediction for me, to build up a sufficiently big enough training set for the meta-model.

So my questions;

What would be the way forward for me? I’m thinking about subclassing EnsembleModel. Or perhaps a bit more manually doing the historical_forecasts manually, and training a model based on that. I did not find any similar issues/questions in here, but a bit related;

madtoinou commented 1 month ago

Hi @stamm1989,

The "meta-model" you are referring to is the RegressionModel that is consuming the ensembled models as covariates to generate the final forecast, right? In order to do out-of-sample forecasting, you do not need to use the train_using_historical_forecasts feature. The idea of using historical_forecasts() in fit()/predict() for the EnsembleModel is to improve the quality of the covariates generated by the ensembled models (instead of predicting them in one go with a single predict); they are built point by point, leveraging the latest information without causing data-leak.

To answer your questions point by point:

In order to get out-of-sample prediction, you just need to specify the series to forecast when you call predict (while making sure that the ensembled models are all transferable).

Let me know if it clarifies things

stamm1989 commented 1 month ago

Hi,

Thank you for your response, it already clarifies a lot. Indeed, the meta-model/ensemble-model/RegressionModel needs to consume covariates. Where the covariates come from predictions from child-models, e.g. an ARIMA or LightGBM and/or potentially other. So to build up this training-set for RegressionModel, I need predictions for each child-model. These predictions need to be out-of-sample, to not trigger data-leakage.

I think I misinterpreted the code before. Am I correct by saying; If today is time t, RegressionEnsemble first fits base models on timestamps 0:(t - regression_train_n_points). Then it predicts the remaining regression_train_n_points, using predict() in one go, or based on historical_forecasts(), dependent on train_using_historical_forecasts. I agree that these would be out-of-sample predictions. I initially thought that base-models would be trained on 0:t, and then historical_forecasts() would be called to generate the t-regression_train_n_points:t, but this would generate in-sample/data-leakage indeed, therefore my initial question to allow for retrain=True. I'd like to use train_using_historical_forecasts, to build up a sufficiently large training set for the RegressionModel.

I'm using an ARIMA (not AutoARIMA), 2281 mentions that it is also transferable? And I see that is allows to offer a series parameter/ code seems to mention that it is also transferable.

Still some questions remaining, therefore I've added some code below, and put my questions in comments:

from darts.models import RegressionEnsembleModel, ARIMA, LightGBMModel
import pandas as pd
import numpy as np
import darts
from darts import TimeSeries

# Generate training dataset
data_set = pd.DataFrame({'calendar_date': pd.date_range("2021-01-01", "2024-01-01")})
data_set['month_num'] = data_set['calendar_date'].dt.month
data_set['week_day'] = data_set['calendar_date'].dt.weekday
data_set = pd.concat(
    [
        data_set,
        pd.get_dummies(data_set['week_day'], prefix='weekday_dummy').astype('int')
    ],
    axis=1
)
data_set['series'] = np.sin((data_set['calendar_date'] - data_set['calendar_date'].min()).dt.days / 7)
print('pandas_dataset')
print(data_set)

# Convert to darts timeseries + take some slices
series = TimeSeries.from_dataframe(data_set,"calendar_date")
series_4w = series[0:-28]
series_2w = series[0:-14]
print('darts timeseries')
series['series'].plot()

# feature subset
all_week_dummies=['weekday_dummy_0','weekday_dummy_1','weekday_dummy_2','weekday_dummy_3','weekday_dummy_4','weekday_dummy_5','weekday_dummy_6']
subset_week_dummies = all_week_dummies[0:3]

# arima fit/predict
arima=ARIMA(p=1,d=0,q=1)
arima.fit(series=series_4w['series'],future_covariates=series_4w[subset_week_dummies])
forecasts_4w_arima = arima.predict(n=14, series=series_4w['series'], future_covariates=series[subset_week_dummies])
forecasts_2w_arima = arima.predict(n=14, series=series_2w['series'], future_covariates=series[subset_week_dummies])

# lgbm fit/predict
lgbm=LightGBMModel(lags=7,lags_future_covariates=list(range(0,1)),categorical_future_covariates=['week_day'])
lgbm.fit(series=series_4w['series'],future_covariates=series_4w[['week_day']])
forecasts_4w_lgbm=lgbm.predict(n=14,series=series_4w['series'],future_covariates=series['week_day'])
forecasts_2w_lgbm=lgbm.predict(n=14,series=series_2w['series'],future_covariates=series['week_day'])

# Plot some results
series['series'].plot(label='full-series')
series_2w['series'].plot(label='2w-later')
series_4w['series'].plot(label='training')

forecasts_2w_arima.plot(label='arima_preds')
forecasts_4w_arima.plot(label='arima_preds')
forecasts_2w_lgbm.plot(label='lgbm_preds')
forecasts_4w_lgbm.plot(label='lgbm_preds')
# Does not work
# Question; Error message `Cannot instantiate EnsembleModel with a mixture of unfitted and fitted `forecasting_models`` -> both models are fitted
model = RegressionEnsembleModel(
    forecasting_models = [arima,lgbm],
    train_forecasting_models=True,
    train_using_historical_forecasts=True,
    regression_train_n_points=21
)
# Does not work
# Question; 
# ERROR:darts.models.forecasting.regression_ensemble_model:ValueError: `train_using_historical_forecasts=True` is only available when all `forecasting_models` are global models.
# Why is this permitted? arima/lgbm are both able of generating historical_forecasts, i.e. 
# arima.historical_forecasts(series=series['series'],future_covariates=series[subset_week_dummies],retrain=False, start=0.9)
# lgbm.historical_forecasts(series=series['series'],future_covariates=series[['week_day']],retrain=False, start=0.9)
model = RegressionEnsembleModel(
    forecasting_models = [arima.untrained_model(),lgbm.untrained_model()],
    train_forecasting_models=True,
    train_using_historical_forecasts=True,
    regression_train_n_points=21
)
# Works, But...
# Question, forecasting models are re-trained, but features of base models are not respected?
ensemble_model = RegressionEnsembleModel(
    forecasting_models = [arima.untrained_model(),lgbm.untrained_model()],
    train_forecasting_models=True,
    train_using_historical_forecasts=False,
    regression_train_n_points=21
)
ensemble_model.fit(
    series_4w['series'],
    future_covariates=series_4w[all_week_dummies + ['week_day']]
)

ensemble_preds = ensemble_model.predict(
    n=14,
    future_covariates=series[all_week_dummies + ['week_day']]
)

# arima
print('base arima nr params\n', len(arima.model.params))
print('ensemble arima nr params\n', len(ensemble_model.forecasting_models[0].model.params))
print('')
# lgbm
print('base lgbm features\n', sorted(lgbm.lagged_feature_names))
print('ensemble lgbm features\n', sorted(ensemble_model.forecasting_models[1].lagged_feature_names))
madtoinou commented 1 month ago

Am I correct by saying; If today is time t, RegressionEnsemble first fits base models on timestamps 0:(t - regression_train_n_points). Then it predicts the remaining regression_train_n_points, using predict() in one go, or based on historical_forecasts(), dependent on train_using_historical_forecasts.

This is correct. Assuming that the forecasting models were not fitted before/pre-trained.

I agree that these would be out-of-sample predictions. I initially thought that base-models would be trained on 0:t, and then historical_forecasts() would be called to generate the t-regression_train_n_points:t, but this would generate in-sample/data-leakage indeed, therefore my initial question to allow for retrain=True.

Not sure to understand why the t-regression_train_n_points:t would be in-sample? They will only rely on points from before today t. Furthermore, the ensembling regression model does need to access some values of the training series.

I'd like to use train_using_historical_forecasts, to build up a sufficiently large training set for the RegressionModel.

Using train_using_historical_forecasts will not increase the size of the training set for the RegressionModel.

Question; Error message `Cannot instantiate EnsembleModel with a mixture of unfitted and fitted

The error is triggered because you use a mixture of local and global models. The error message is confusing, the part relevant to your situation is "using only trained GlobalForecastingModels together with retrain_forecasting_models=False".

ERROR:darts.models.forecasting.regression_ensemble_model:ValueError: train_using_historical_forecasts=True is only available when all forecasting_models are global models. Why is this permitted? arima/lgbm are both able of generating historical_forecasts, i.e.

It's not a question of being able to generate historical forecasts, just the fact that ARIMA is transferable but not global. Since the EnsembleModel are defined as GlobalModels, this constraints is enforced for all the "child" models (including the forecasting ones). You can try to comment it here and see if the model behave as expected. The TransferableLocalModels is kind of a special category that comes with very specific constraints compared to the rest of the models.

Question, forecasting models are re-trained, but features of base models are not respected?

What to you mean by "the features are not respected"?

stamm1989 commented 1 month ago

Thanks again for your quick reply.

Not sure to understand why the t-regression_train_n_points:t would be in-sample? They will only rely on points from before today t. Furthermore, the ensembling regression model does need to access some values of the training series.

This is mainly my confusion / incorrect understanding, if base models are trained on 0:t, then without any refitting, historic_forecasts from t-regression_train_n_points:t are seen timestamps by the base-models, but this isn't the case as I understand it now.

Using train_using_historical_forecasts will not increase the size of the training set for the RegressionModel.

Thanks, I meant to say, if increasing the size of the regression_train_n_points to relatively large number, I would prefer the behaviour of using historic_forecasts.

It's not a question of being able to generate historical forecasts, just the fact that ARIMA is transferable but not global. Since the EnsembleModel are defined as GlobalModels, this constraints is enforced for all the "child" models (including the forecasting ones). You can try to comment it here and see if the model behave as expected. The TransferableLocalModels is kind of a special category that comes with very specific constraints compared to the rest of the models.

allright, I'll give this a go.

What to you mean by "the features are not respected"?

Base models can be fitted on different subset of features. In the example I fitted the arima / lgbm on different feature-sets. But if I refit, inside the RegressionEnsemble, the arima/lgbm, it ignores the features where they have initially been fitted on, and all features are used.

E.g. in my code example, lgbm was fitted on a categorical variable week_day, and not on the weekday_dummies. but these dummies end up in the lgbm model of the RegressionEnsemble, based on ensemble_model.forecasting_models[1].lagged_feature_names. For the arima, I trained the base model on a subset of week-day-dummies, but all week-dummies seem to be present inside the arima model of the ensemble_model.

madtoinou commented 1 month ago

This is mainly my confusion / incorrect understanding, if base models are trained on 0:t, then without any refitting, historic_forecasts from t-regression_train_n_points:t are seen timestamps by the base-models, but this isn't the case as I understand it now.

Forecasting models will be fitted until t-regression_train_n_points, used to generate covariates for the range t-regression_train_n_points:t, that will be passed to the "ensembling" regression model. So no data leakage should occur.

Base models can be fitted on different subset of features. In the example I fitted the arima / lgbm on different feature-sets. But if I refit, inside the RegressionEnsemble, the arima/lgbm, it ignores the features where they have initially been fitted on, and all features are used.

Oh yeah indeed, if you refit the forecasting models, they will just consume the series that is passed and completely discard what they were trained on previously. You would have to implement your own custom code to preprocess the series and make sure only the desired features are seen by these models.

madtoinou commented 1 week ago

Hi @stamm1989,

I experimented a bit to see how many changes would be required achieve what you described; using a local and a global model as forecasting models, each pre-trained with different features/series. It appears that it would require a lot of work, and this feature is not one of our priority at the moment.

If you managed to get something functional, we would be more than happy to review your PR and integrate it to the library.

In the meantime, I am closing this issue, feel free to reopen it :)