Historical Backtest With Updating Covariates (Tree/ Regression Models)

ETTAN93 commented 3 months ago

As a follow up to the discussion in #2421, I want to clarify the methodology of doing a historical backtest when your covariates are being updated in reality. Assuming that in production I have hourly data and I would like to do a 7 day forecast everyday. Hence, my model would be defined like this in production:"

lgbm_model =  LightGBMModel(
    lags = list(range(-24, 0)),
    past_future_covariates = None,
    lags_future_covariates = list(range(0, 168)),
    output_chunk_length = 24,
    n_jobs=-1,
    random_state=42,
    multi_models=True,
    verbose=0,
    force_col_wise=True
)

lgbm_model.fit(
    series = target_series_train,
    future_covariates = future_cov_train
)

lgbm_model.predict(n=168,  series=target_series, future_covariates=future_cov_series)

For every day's forecast, I want the model to use the future cov for that day to forecast that day, e.g. to forecast day 3's values, the model should be using the model's own forecast on day 2 and the future covariates on day 3. Henec, I set the params as below:

target lags = [-24,....,-1]
lags_future_covariates = list(range(0, 168)),
output_chunk_length=24
n in predict = 168

To do an equivalent backtest of that, I am slightly unsure how to structure the covariates correctly. Assuming I am retraining everyday, this is how I have constructed the historical backtest:

lgbm_model =  LightGBMModel(
    lags=list(range(-24, 0)),
    past_future_covariates = None,
    lags_future_covariates=list(range(0, 168)),
    output_chunk_length=24,
    n_jobs=-1,
    random_state=42,
    multi_models=True,
    verbose=0,
    force_col_wise=True
)

start_date = pd.Timestamp("2024-01-08 00:00:00") 
split_date = pd.Timestamp("2024-04-30 23:00:00")
end_date = pd.Timestamp("2024-05-22 23:00:00")
backtest_results_dict = {}

for date in pd.date_range(split_date, end_date - timedelta(days = 7)):
    target_series_train = target_series[start_date: date]
    future_cov_train = future_cov_series[start_date: date]

    lgbm_model.fit(
        series = target_series_train,
        future_covariates = future_cov_train
    )

    test_start_date = date + relativedelta(hours = 1)
    target_series_lb = test_start_date - timedelta(days = 1)
    target_series_ub = date
    future_cov_series_lb = test_start_date
    future_cov_series_ub = date + timedelta(days = 1)
    forecast_results_7_day = pd.DataFrame()

    for day in range(1, 8):
        data_df = raw_data_df[raw_data_df['forecast_horizon'] == day]
        target_series_test = TimeSeries.from_dataframe(data_df[target_col])[target_series_lb: target_series_ub]
        future_cov_series_test = TimeSeries.from_dataframe(data_df[future_cov])[future_cov_series_lb: future_cov_series_ub]
        forecast_results = lgbm_model.predict(24, series=target_series_test, future_covariates=future_cov_series_test).pd_dataframe()
        forecast_results_7_day = pd.concat([forecast_results_7_day, forecast_results])

        #update dates
        target_series_lb += timedelta(days=1)
        target_series_ub += timedelta(days=1)
        future_cov_series_lb += timedelta(days=1)
        future_cov_series_ub += timedelta(days=1)
    forecast_results_7_day.rename(columns = {'act_price': 'new_best_guess'}, inplace=True)
    backtest_results_dict[str(date)] = forecast_results_7_day

I have constructed a raw_data_dfdataframe that contains different forecast horizons corresponding to the future covariates that are being updated, e.g. to forecast day 1, I would use rows with forecast_horizon = 1 in the raw_data_df. However, in my implementation, the future covariates that I am passing are different to the ones in the "production" example above since I am only passing the future covariates for that day specifically, i.e. future covariates for day 3 when forecasting day 3, then day 4 when forecasting day 4.

This doesn't seem correct to me because the production example is taking in all 7 days of future covariates and forecasting the next 7 days at one go. How do I set my future covariates in the backtest to be same as that?

ETTAN93 commented 3 months ago

Hi @dennisbader, based on what we discussed, shouldn't this implementation be correct?

lgbm_model =  LightGBMModel(
    lags=list(range(-24, 0)),
    lags_future_covariates=list(range(0, 24)),
    output_chunk_length=24,
    n_jobs=-1,
    random_state=42,
    multi_models=True,
    verbose=0,
    force_col_wise=True
)

start_date = pd.Timestamp("2024-01-08 00:00:00") 
split_date = pd.Timestamp("2024-04-30 23:00:00")
end_date = pd.Timestamp("2024-05-22 23:00:00")

for date in pd.date_range(split_date, end_date - timedelta(days = 7)):
    target_series_train = target_series[start_date: date]
    future_cov_train = future_cov_series[start_date: date]
    future_cov_train = future_cov_series[start_date: date]
    print(f'Predicting from : {date + relativedelta(hours = 1)} to {date+relativedelta(days = 7)}')

    lgbm_model.fit(
        series = target_series_train,
        future_covariates = future_cov_train
    )

    data_df = raw_data_df[raw_data_df['as_of'].dt.date == date.date()]
    target_series_test = TimeSeries.from_dataframe(data_df[target_col])
    future_cov_series_test = TimeSeries.from_dataframe(data_df[future_cov])
    forecast_results = lgbm_model.predict(n=168, series=target_series_test, future_covariates=future_cov_series_test).pd_dataframe()

Why do I get the error that the future covariates are not long enough? Based on the error message below, it would mean the model is using 7 days of future covariates to predict each day rather than day 1 future covariate to predict day 1 etc.

dennisbader commented 3 months ago

The error message tells you how long your future covariates need to be to generate the entire forecast horizon n (7 days). Could it be that your target_series_test at the first iteration ends at 2024-05-07 23:00:00?

ETTAN93 commented 3 months ago

ah ok I got my mistake now. With this setup, I am retraining/refitting the model daily with an expanding window. Hence, I do not need to pass in a target_series to the predict function because the target_lag required is already part of the target_series_train. So for day 1 of the first iteration of target_lags = [-24, 0] the model will look at the 2024-04-30 data in target_series_train to do the prediction. For days 2-7, the target_lags will be done in an autoregressive manner.

so this would work:

for date in pd.date_range(split_date, end_date - timedelta(days = 7)):
    target_series_train = target_series[start_date: date]
    future_cov_train = future_cov_series[start_date: date]
    print(f'Predicting from : {date + relativedelta(hours = 1)} to {date+relativedelta(days = 7)}')

    lgbm_model.fit(
        series = target_series_train,
        future_covariates = future_cov_train
    )

    data_df = raw_data_df[raw_data_df['as_of'].dt.date == date.date()]
    future_cov_series_test = TimeSeries.from_dataframe(data_df[future_cov])
    forecast_results = lgbm_model.predict(n=168, future_covariates=future_cov_series_test).pd_dataframe()

However, If I take the fit out of the for loop and don't do retraining. then I would need to pass in a new target_series_test to the predict function for every iteration so that it starts predicting from the right timepoint.

Is that correct?

dennisbader commented 3 months ago

Yes, that's correct, and it works because you only train on 1 target series. In this case the training series is attached to the model. If you trained on multiple series, we don't attach them to the model, and you always have to provide a series to predict().

ETTAN93 commented 3 months ago

For the use case where I do not want to train the model daily, then the implementation above wouldn't work?

I will definitely have to pass a new target_series_test to the predict function every iteration and the only way to do that is to store the results from the previous day's forecast. The only way I can do that is to break the predictions each day via an inner loop within each iteration, e.g.

lgbm_model.fit(
        series = target_series_train,
        future_covariates = future_cov_train
    )
target_series_test = TimeSeries.from_dataframe(data_df[target_col])[target_series_lb: target_series_ub]

for date in pd.date_range(split_date, end_date - timedelta(days = 7)):
    for day in range(1, 8):
        data_df = best_guess_df[best_guess_df['forecast_horizon'] == day]
        future_cov_series_test = TimeSeries.from_dataframe(data_df[future_cov])[future_cov_series_lb: future_cov_series_ub]
        forecast_results = lgbm_model.predict(24, series=target_series_test, future_covariates=future_cov_series_test)
        target_series_test = forecast_results #stores previous forecast results

Or is there another way you would recommend?

dennisbader commented 3 months ago

I don't understand how the target series is related to the previous day's forecast. Since your doing a backtest (historical), you should have all actual values of your target series available.

I assume you have a target series covering the entire time range from (start of data, present moment). Then you just have to expand the target series slice (e.g. move the endpoint ahead) for each forecast.

e.g.:

# the training series is extracted from the long series
train_target_series = entire_target_series[:train_end]

# now let's say you want predict 3 days after you trained it ( 3 days * 24 hours * target frequency / hourly)
new_end = train_end + 3 * 24 * train_target_series.freq
test_target_series = entire_target_series[:new_end]

# and the future covariates have to end n=168 time steps after `new_end`

ETTAN93 commented 3 months ago

Ah sorry, yes. I keep confusing myself by thinking about the predict with n = 24. Setting n = 168 would take care of the autoregressive component and the model would use day 1's forecast as day 2's target lag under the hood. So I would not need to store the forecast results for each day and just pass the expanded target series to the predict method for each forecast.

Another question @dennisbader on the model methodology, according to the documentation, I understood that when multi_models = True, 1 model will be created per time step, i.e. in my case 24 models will be trained on each time step. This would mean that each sub-model would learn the relationship between 1 specific timepoint and all the specified lags. For example,

Is it possible implement using Darts where you are only training using the future_covariate specific to the same timestamp of your target without passing in the rest, i.e. you have 1 model only that is trained on this for day 1

then you retrain again on day 2 for:

This means that I am trying to predict the target at each timestamp with only the specific future covariate for the same hour.

ETTAN93 commented 1 month ago

hi @madtoinou/ @dennisbader, not sure if you had a chance to look at the use case above?

madtoinou commented 1 month ago

Hi @ETTAN93,

If you want each sub-model to look only at a specific timestamp of the covariate, and the shift/lag is always the same with respect to the predicted timestamps, you might as well use a single model (output_chunk_length=1) to predict all the timestamps (it's kind of what you are describing in your message) :) Eventually with lags on the target that are greater than the forecasted period (to avoid auto-regression) and manually looping to predict all the steps in the horizon.

One of the advantage of this multi model approach is to have models forecasting different timestamps while accessing the exact same information, this "sub-model with different future covariates lags" takes the opposite direction.

Let me know if this is clear & helpful

ETTAN93 commented 1 month ago

Hi @madtoinou, I am not really sure I understand fully the point that you are making. Doesn't setting output_chunk_length = 1 just mean that the model predicts only 1 timestep ahead at each point in time? What would that have to do with which features are being fed into the model? Can you explain it in a different way?

What you said about it being a single model is true because you would just use a specific hour of a day as a feature to forecast the same hour on the next day and construct the feature set that way.

I understand the advantages of the multi model approach but was just wondering if the example I suggested could make sense in the specific use case where, for example, that the value at 3pm would not affect the value at 2pm and vice versa.

madtoinou commented 1 month ago

My point was that using multi_models=True while trying to "predict the target at each timestamp with only the specific future covariate for the same hour" is antinomic; the whole point of this approach is to allow all the sub-models to access exactly the same information to predict each step in output_chunk_length.

To phrase it differently, using output_chunk_lags=1, lags_future_covariates=[0] (assuming that you want to rely on the value of the covariates at the same hour of the prediction) will yield pretty much the same result as multi_models=True, output_chunk_length=7, lags_future_covariates=[0-6] with fancy logic that slices the covariates so that each sub-model accesses only the covariate's value "aligned" with the step of output_chunk_length it is forecasting because the tabularization of the series during the training would be identical.

If the model does not rely only on covariates but also past values of the target/past covariates, then it could make sense to add some additional logic around the covariates to avoid the curse of dimensionality but it does not seem to be the case in what you are describing.

ETTAN93 commented 1 month ago

@madtoinou, I get what you are saying now and I completely agree. However, looking back at the question that I posted, I think the data example I was showing is slightly misleading. Let me try to rephrase in a better way using past covariates.

When saying "predicting the target at each timestamp with the future covariate of the same hour", I do not mean the same hour of the target for the same day but actually from the previous days (for past covariates) or future days (for future covariates). From the example that you gave, I am not sure if you understood it as the former?

For example, let's say we have 2 days of data below where feature_1 is a past covariate:

For backtesting I want to predict all 24 values on 1/2/2024 and I think that only the same hour on the previous day for the past covariate would have an impact on the same hour of the target but not any preceding or following hours, so the training set I want to construct would look like this:

This would mean that for each timestamp I am trying to predict, the past covariate feature used to predict each target will be the value from lag t-24 only.

How would this be implemented in Darts? Using output_chunk_length = 24, lags_future_covariates = [-24] does not seem to be right as that would only use the value at 1/1/2024 00:00:00 to predict all 24 values? And according to your explanation above, this is equivalent to setting multi_models = True, output_chunk_length = 24, lags_future_covariates = [0 .... -24]?

madtoinou commented 1 month ago

In this context, if you use output_chunk_length=1, lags_future/past_covariates=[-24], and set n=24 when calling predict() or horizon=24 when calling historical_forecasts(), you will get the intended behavior. Furthermore, auto-regression will not really happen because the model does not have lag on the target and each step of the target will be forecasted only with the desired covariate lags (same hour, previous day).

You do not need multi_models=True or output_chunk_length>1, because you don't need to avoid auto-regression since the model does not have lags on the target series and you want to have different covariates for each forecasted step.

ETTAN93 commented 1 month ago

@madtoinou got it. If I were to to include target lags but also only for the same hour (previous day): then can it be done the same way, i.e. output_chunk_length=1, lags_future/past_covariates=[-24]and target_lags = [-24]?

From what I understand, auto-regression still would not happen because the target_lag is set to -24 and to forecast t+1 the model will use the target lag at t-23

madtoinou commented 1 month ago

Correct, the forecast quality might deteriorate when n > 24 as the model will start to consume its own forecasts to predict the next steps (t0, when predicting t+24). Note that this is not necessarily a problem, and don't need to be avoided at all cost.

Since the topic of the discussion deviated quite a bit from the original topic, I am going to close the issue. Feel free to reopen it or open a new one if you still have questions.

unit8co / darts

Historical Backtest With Updating Covariates (Tree/ Regression Models) #2452