cannot figure out how to add 'offset' when training (gap between training data and forecasting period/testing)

Allena101 commented 10 months ago

Hello again 👋

I have tried to find to add a gap between testing and the forecasting period. Lets say i am training a model and i want it to predict days for, NOT next week, but the week after. So if the timestep is daily (one per day) , then i would need a gap of 7.

I have read several of your tutorials, but as before, i could have misunderstood. Of what i can discern, it says nothing about adding a 'gap' in either the TimeSeries.from_dataframe or model.fit(). The docs seem to be suggesting that i should look at fit_from_dataset() , but from what i can see, there is no explaining how to use that method for adding a gap.

I am pretty sure i could add the gap with regular python for loop and then convert that to a timeseries object, though then (but i am not sure if it would work as i wanted). Tensorflow, in their timeSeries guide also shows how to add a gap/offset: https://www.tensorflow.org/tutorials/structured_data/time_series.

Anyways, Merry Christmas and thanks for your awesome work on darts 🎄

Allena101 commented 10 months ago

probably the simplest work around would be to just forecast 2 weeks and then just measure the rmse (or whatever metric you decide to use) on the last week.

Allena101 commented 10 months ago

another thing that is related to this (and why i asked this in the first place) is that i am looking to forecast a period in a manner that is very similar to imputation/interpolation. I have a gap week and i want to fill in those values. In this instance, it is very likely that the gap will be towards the end of the series, so it makes sense to forecast that period and not just use some kind of gaussian approach (although GaussianProcessFilter sounds like something i should explore as well).

So essentially i would like to train the model to encounter a gap, (that needs to be imputed/forecasted) but it should still have future real values that informs the model where the predictions should end up. For example, lets say i want to impute the electricity usage during a period when (for whatever reason) measurements could not be taken, and electricity is still being used, BUT later i know the electricity usage (the total usage for the missing period) so it would make sense to use that information when imputing (a bit similar to when you use interpolation and direction=both). So the model should know the end value (the first value or more after the gap) , and it also knows the cumulative value for all the data points that needs to be imputed.

So to make the model the best at that particular situation, then it sounds like a good idea to train it for just that scenario. Which would mean training on for example 4 weeks of data, where 3 weeks are continuous (i.e. monotonic increasing), then there is a one week gap and then the 4:th week, and the model then tries to predict the gap week.

I hope what i am writing makes sense. How i would do it without darts would be to put the 3 first weeks AND the week after the gap in X_train and the gap itself in y_train. I would have to test if it would work better by using some version of intermittent missing data imputation (i.e. include the gap week in X_train as well but mark it as missing) . Why i find this so difficult to test with darts is that you dont divide between X_test and y_test when using darts timeSeries objects.

madtoinou commented 10 months ago

Hi @Allena101,

If you use any RegressionModel, you need to pick the lags so that it reflects the constraints or process the forecasts to match your constraints. In your case, you can either:

use lags strictly smaller than -8 so that, when you try to forecast the timestep $t_0$ (to $t_6$ if you want to forecast the whole week), it access only the data of "two weeks ago" (since lags -7 to -1 correspond to the previous week, with respect to the predicted week).
set output_chunk_length > 7 and then extract the last values of the forecasts (as mentioned in your 2nd message).

Filling gaps within a series is a slightly different problem, instead of regression/deep learning model, you should probably use statistical model that support missing values and then perform in sample forecasting to impute the missing values or any dedicated method.

If you really want to use a regression model, you will need to use the series as both target and future covariates, with a choice of series slicing, lags and model parameters so that the model can predict the missing chunk. For you example with 3 weeks, you will probably need output_chunk_length = 7, lags_future_covariates=[-7, -6, ..., -1, 7, 8, ..., 13] (7 values in the past, skip the 7 missing, 7 in the future). The target series should be the first 2 weeks of the target and the future covariates should be the first 3 weeks of the target (without the missing values).

If you data is long enough (or the size of the gap is not constant), you can train the model without the longest slice without missing values and then infer the missing sections by slicing the series passed to predict() so that it ends just before the gap (as Darts predict n values after the end of the provided series).

One of the point of Darts is precisely to generate those for you under the hood but you can manually create them by calling the methods from the tabularization module directly or adding break-points in the code. After working a bit with lags (and understanding how tabularization works, based on the examples in the regression model example notebook), it get easier.

Allena101 commented 10 months ago

Hi @Allena101,

If you use any RegressionModel, you need to pick the lags so that it reflects the constraints or process the forecasts to match your constraints. In your case, you can either:

use lags strictly smaller than -8 so that, when you try to forecast the timestep t0 (to t6 if you want to forecast the whole week), it access only the data of "two weeks ago" (since lags -7 to -1 correspond to the previous week, with respect to the predicted week).

set output_chunk_length > 7 and then extract the last values of the forecasts (as mentioned in your 2nd message).

Filling gaps within a series is a slightly different problem, instead of regression/deep learning model, you should probably use statistical model that support missing values and then perform in sample forecasting to impute the missing values or any dedicated method.

If you really want to use a regression model, you will need to use the series as both target and future covariates, with a choice of series slicing, lags and model parameters so that the model can predict the missing chunk. For you example with 3 weeks, you will probably need output_chunk_length = 7, lags_future_covariates=[-7, -6, ..., -1, 7, 8, ..., 13] (7 values in the past, skip the 7 missing, 7 in the future). The target series should be the first 2 weeks of the target and the future covariates should be the first 3 weeks of the target (without the missing values).

If you data is long enough (or the size of the gap is not constant), you can train the model without the longest slice without missing values and then infer the missing sections by slicing the series passed to predict() so that it ends just before the gap (as Darts predict n values after the end of the provided series).

One of the point of Darts is precisely to generate those for you under the hood but you can manually create them by calling the methods from the tabularization module directly or adding break-points in the code. After working a bit with lags (and understanding how tabularization works, based on the examples in the regression model example notebook), it get easier.

Thanks alot for the answer madtoinou!

Using lags in the way you described makes a lot of sense! btw can I use slicing when providing minus lags? something like lags = [-8:-15]?

Do you have any suggestion of what would be a good standard model to use when imputing? Since including the seasonality in the imputation is pretty important in my scenario do you think using Seasonal Trend Decomposition with Loess (STL) would be a good starting point? I think it works so that you decompose the time series into seasonal, trend and residuals and then you interpolate all of these series separately and then add them back together.

Your suggestion for regression and using future covariates in that way sounds promising and advanced. It would rely a lot on the model realizing that the future covariate (that the target value in the future is where the target predictions should end up at). Lets say it works really well but the forecasted imputation is not adding up to the value after (the daily sales does not add up to the local weekly sales) - is there an easy darts method (or general method) to trim (or expand) the series so that it fits the target? Lets say that the imputed values turns out to be slightly less than the next known value. then if you could somehow just add (or subtract) the difference so that each daily sale totals up to that week of sales. I think i could figure out to do this with some messy python function but it woudl be awesome fi there was a build in way that i did not know of.

regarding future covariates, there is one part that I am having a hard time understanding. For example, using weather ground truth as past_covariate and weather forecasts as future covariate makes total sense, but in practice you would not be able to provide weather forecasts for the past. For example, most weather api:s have historical weather data for years and weather forecast for about 2 weeks in the future, but historical weather forecasts are not saved. So if you provide weather forecast as your future covariate, then it would only be available for maximum a couple of weeks (just 2 weeks in the future if you are predicting from the current day). In this example. are you supposed to use historical weather data as your future covariates as well (i hope you understand what i mean)? With your guys other example - using time features (datetime_attribute_timeseries), you would have access to them always since you can generate those features from the dateTime itself.

madtoinou commented 10 months ago

For target and past covariates lags (respectively lags and lags_past_covariates, if you want a gap between the first lag and the forecasted period, you need to pass a list of the lags for example lags=list(range(-15,-7)).

You could potentially use some of the hierarchical reconciliation algorithms but it is probably not the most straight-forward way of solving this problem. Since this gap is unknown, you should not really be able to tell if the last forecast being "slightly less than the next known value" is problematic or not. You would eventually smooth the values but it becomes very arbitrary...

Using different representations of the same measurements (historic weather/weather forecast) as covariates is absolutely not necessary. If you did not save historical weather forecasts, you can't really train a model using this as a future covariates. As an approximation, you could use the historical weather as a future covariates during the training (as you mentioned) knowing that the forecasts will never be as accurate : validating this model with the actual weather forecasts as covariates becomes critical. I don't think that there is an ideal approach, it ultimately depends on your dataset and the variable you're trying to predict.

Allena101 commented 10 months ago

For target and past covariates lags (respectively lags and lags_past_covariates, if you want a gap between the first lag and the forecasted period, you need to pass a list of the lags for example lags=list(range(-15,-7)).

You could potentially use some of the hierarchical reconciliation algorithms but it is probably not the most straight-forward way of solving this problem. Since this gap is unknown, you should not really be able to tell if the last forecast being "slightly less than the next known value" is problematic or not. You would eventually smooth the values but it becomes very arbitrary...

Using different representations of the same measurements (historic weather/weather forecast) as covariates is absolutely not necessary. If you did not save historical weather forecasts, you can't really train a model using this as a future covariates. As an approximation, you could use the historical weather as a future covariates during the training (as you mentioned) knowing that the forecasts will never be as accurate : validating this model with the actual weather forecasts as covariates becomes critical. I don't think that there is an ideal approach, it ultimately depends on your dataset and the variable you're trying to predict.

thanks again for the informative response!

after doing some testing myself i dont think its easy to just even out a prediction since you will never know the last step. I came up with a bad solution of taking the known last step minus the average value of the whole series and set that as the "goal" for the evening out. But i still suspect i am not doing a good job at the task.

regarding future_covariates. In your guide, you show how create calendar attributes in the model instantiation:

model = LinearRegressionModel( lags=None, lags_future_covariates=(24, 1), add_encoders={ "cyclic": {"future": ["minute", "hour", "dayofweek", "month"]}, "tz": "CET", }, )

With this, does that mean that those covaraites will be created when the fit method is called? Also, how do i do this , but for the add_holidays method? (e.g. series.add_holidays("US"))

I realize now that i should have made a new issue for some of these questions! I will make sure to do that next time if i got any further questions.

madtoinou commented 9 months ago

It's difficult to give recommendation without being familiar with the data/task, but it's always good to start with something and improve it step by step.

The encoders create the covariates on the spot, when either fit() or predict() is called.

I saw that you opened a separate issue for the adding the possibility to generate holidays covariates using the encoders, let's continue the conversation there.

Leaving this issue open as it will automatically be closed when the linked PR is merged.

gitbooo commented 8 months ago

Thank you for the updates, which provide a useful alternative to using fit_from_dataset() with a shifting parameter when creating the DataLoader. My question concerns testing time, especially regarding historical forecasts: Should we manually add the gap in the predictions, for example, by doing something like concatenate([forecast.shift(shift) for forecast in hf]), or is this process automated?

dennisbader commented 8 months ago

@gitbooo, with the new output_chunk_shift model creation parameter from Darts version 0.28.0, everything is handled under the hood. See #2176.

unit8co / darts

cannot figure out how to add 'offset' when training (gap between training data and forecasting period/testing) #2132