Prediction based only on covariates

unit8co / darts

A python library for user-friendly forecasting and anomaly detection on time series.

https://unit8co.github.io/darts/

Apache License 2.0

7.91k stars 857 forks source link

Prediction based only on covariates #1289

Closed aerler closed 10 months ago

aerler commented 1 year ago

I am trying to predict a target timeseries based on past and future covariates. However, I only have limited data for the target timeseries and in production it wont be available at all, but I have a lot of covariates, which are easily available and in principle should allow fairly good predictions.

My issue is that there does not seem to be any way to train a model in such a way that the target timeseries is not also used as a predictor. It appears that all models and all fit methods will always also use past values from the target series as predictors, but this wont work for my problem, as these values are often not available. Furthermore, I find that if I train models with the target series as a predictor, the models become to reliant on extrapolating past values and don't really use the covariates (which is not what I want).

I am wondering if this could be addressed with a new type of Dataset, but it would seem to me that this would also have to be taken into consideration during model construction. Unfortunately I don't understand the inner workings of DARTS and Torch well enough to implement this myself...

Or am I missing something here? Is this already possible? It seems to me that this should be a fairly common type of prediction problem.

Also, thanks for publishing this fantastic package! and for any suggestions!

RodgervanderHeijden commented 1 year ago

Just checking if I've understood your goal correctly; do you want to predict a variable of interest using only covariates?

If so, I'm afraid you're trying to eat soup with a fork! The core of darts is handling and forecasting time series data based on some calculation of its previous values (where some models allow covariates to further capture the signal). If I interpreted your issue correctly, you explicitly do not want that to happen and instead wish to predict the target series based on the value of the covariates, irrespective of previous values. In that case, any model which does not require that temporal structure, e.g. regressions or tree-based models, would instead be suitable for your problem.

If not, I've misinterpreted your goal and would like to receive elaboration on what your problem exactly is, and how the axis of time is relevant to it.

aerler commented 1 year ago

Thanks for your response! I think you mostly understand what I am trying to do, but maybe not the "why". Yes, I want to predict a timeseries using only covariates. The timeseries is not really autoregressive, but it is strongly determiend by the covariates and their temporal structure. This is very common in physical systems, and in my case, which is streamflow prediction based on weather forecasts, well supported by the scientific literature. Specifically, I want to use an LSTM for this task, as the memory/internal state of the system is critical for prediction. Judging by your response, this is not an application that DARTS developers have considered, but it seems to me that the DARTS framework would still be very well suited to this task, since it has very good handling of timeseries etc. and the conceptualization is very good - except for this small problem of enforcing auto-regression: initially I didn't think it would hurt, but unfortunately it does seem to be the case, when I compare my results with what has been published in the scientific literature (and also just looking at it). I hope this explanation makes sense. It would be a pity if I had to abandon DARTS and reinvent a lot of the machinery available in DARTS.

tiantheunissen commented 1 year ago

Hi! Sorry to butt in. I have a similar issue. As far as I can tell, DARTS (at least the deep learning models) are auto-regressive (AR). I looked into making modifications to allow non-AR time series forecasting, but AR is heavily assumed throughout DARTS' underlying modules and time series handling. I have focused on the TCN forecasting model.

A possible alternative might be to use the TFT forecasting model but make your 'targets' the future covariates and your 'future covariates' the targets. I don't think the future covariates are fed back into your model. With some tweaks you might then measure your performance on the future covariates (which are actually your targets now)? I have not tried to implement this because I am doing ad-hoc explainability and the TCN is a bit simpler to handle in that regard.

@aerler If you find anything remotely as useful as DARTS but for non-AR time series forecasting, I would be very interested :)

RodgervanderHeijden commented 1 year ago

Thanks for the elaboration, the explanation is perfectly clear! Importantly though, I'm not a contributor to darts so I do not know what they have and have not considered.

I do share your view that your desired architecture (currently) seems to not be supported by darts, sadly, but cannot expand on why that decision would have been made. I've also scanned through the roadmap but didn't see anything related to this specific use case either, so I'm afraid you're going to need to find alternatives.

hrzn commented 1 year ago

Hi, thanks for raising this point. You're right that there's an assumption throughout Darts, which is that forecasting models define future time steps as functions of historical time steps (and potentially other things, which we call covariates).

The good news is that RegressionModels are somewhat more general and not relying on this assumption. It is possible to instantiate a RegressionModel wrapping around any sklearn-compatible regressor (or one of LightGBMModel, RandomForestModel, LinearRegressionModel, CatBoostModel) in the following way:

my_model = RegressionModel(lags_future_covariates=[-1, 0])

This will predict the target using only the series provided as future_covariates, using only the lag before (-1) and current (0) to the timestamp being predicted (you can of course change this and add other lags).

For deep learning models, I think we could make the target optional in some of them that are supporting future covariates (such as RNNModel). It's not really on our to-do list at the moment, but if there's a strong demand we could re-prioritize this.

AhmetZamanis commented 1 year ago

Hi, I believe I am currently having a similar issue with LinearRegressionModel.

I want to fit a linear model to learn and decompose the trend, seasonality and calendar effects in my multivariate time series. The future_covariates I'll use for this model consist of trend dummies, Fourier features, and calendar effect dummies. I do not want to create any lags for these. I also don't want to use target lags in this model, as I'll do that in the second step after decomposition. Unfortunately I can't train the model without specifying lags and lags_future_covariates when creating the model.

I want to fit a second model to the decomposed residuals of the first, and the future covariates for this second model consist of the target and covariate lags I previously created, as well as some rolling features. I don't want to create any additional lags of these either.

I believe setting lags_future_covariates = [0] solves the problem for the future covariates (not sure as I haven't tried it yet), but the documentation for LinearRegressionModel seems to suggest lags is not optional, and must be non-zero. It seems this also applies to any sklearn model instantiated with RegressionModel.

Also, if I am not mistaken, if I choose to pass only the original covariate series as future_covariates, and specify the covariate lags in RegressionModel with lags_future_covariates, this specification will apply to all covariates. This isn't very flexible, as in my case I want to use different lags for several different covariates, and no lags for some covariates, such as rolling features. This was my reason for creating the lags myself before creating the Darts TimeSeries, instead of specifying lags in lags_future_covariates.

For now, I will attempt to do the modeling and predictions with sklearn, and put back the predictions into a Darts TimeSeries for reconciliation. Besides this small problem, I am also very happy with Darts' workflow and available methods (after trying sktime and fable in R), and I'd prefer to do all my time series analysis in Darts as much as possible.

simonweppe commented 1 year ago

Hi @hrzn , Just to make sure I understand things correctly, for the regression models, if we leave the model=RandomForest(lags = None, ..., when initializing the model, then the lagged copies of the target variable will NOT be used to fit the model, only covariates for which lags_past/future_covariates !=None will be used. Correct ?

So in the example code below, the model is fitted using lags 0 to 30 of variable covar1 to predict target ?

Thanks for the great toolbox !

ps: would be awesome to have that capability for the neural networks as well, I did have a look at the code, but doesnt seem that easy without a deep understanding of the code structure.

model = RandomForest(
        lags=None, # do not use past IMF value
        lags_past_covariates= 30, #use the last 30 days of SSTA 
        lags_future_covariates=None, 
        output_chunk_length=30, # predict the next 30 days  
        multi_models=True) 

model.fit(
        series = ts_train['target'],
        past_covariates=ts_train['covar1'], # i.e. use SSTA as covariate to predict future IMFs
        future_covariates=None )

model.predict(30, series=ts_val['target'], past_covariates=ts_val['covar1'],) # makes a 30-day prediction of 'target', using only lags of 'covar1'

hrzn commented 1 year ago

Hi @hrzn , Just to make sure I understand things correctly, for the regression models, if we leave the model=RandomForest(lags = None, ..., when initializing the model, then the lagged copies of the target variable will NOT be used to fit the model, only covariates for which lags_past/future_covariates !=None will be used. Correct ?

So in the example code below, the model is fitted using lags 0 to 30 of variable covar1 to predict target ?

Thanks for the great toolbox !

ps: would be awesome to have that capability for the neural networks as well, I did have a look at the code, but doesnt seem that easy without a deep understanding of the code structure.
model = RandomForest(
        lags=None, # do not use past IMF value
        lags_past_covariates= 30, #use the last 30 days of SSTA 
        lags_future_covariates=None, 
        output_chunk_length=30, # predict the next 30 days  
        multi_models=True) 

model.fit(
        series = ts_train['target'],
        past_covariates=ts_train['covar1'], # i.e. use SSTA as covariate to predict future IMFs
        future_covariates=None )

model.predict(30, series=ts_val['target'], past_covariates=ts_val['covar1'],) # makes a 30-day prediction of 'target', using only lags of 'covar1'

Yes, correct.

simonweppe commented 1 year ago

thanks !