unit8co / darts

A python library for user-friendly forecasting and anomaly detection on time series.
https://unit8co.github.io/darts/
Apache License 2.0
7.89k stars 854 forks source link

ARIMA with Past Covariates #2457

Open furkancanturk opened 1 month ago

furkancanturk commented 1 month ago

Why is ARIMA with past covariates not implemented in the library?

madtoinou commented 1 month ago

Hi @furkancanturk,

Darts is directly relying on the implementation of the ARIMA from statsmodels, pmdarima and statsforecast. To my knowledge, none of these models support past covariates as Darts defined them. Do you maybe have a reference of an implementation that would support them?

Note that the constraints on future covariates are more severe than those on past covariates, meaning that if a model supports future covariates, you can pass past covariates as future covariates.

furkancanturk commented 1 month ago

The following example trains an ARIMAX model with future and lagged exog variables by generating lag features explicitly. However, this kind of workflow diverges from Darts pipelines I implemented. I guess implementing a separate pipeline for ARIMAX would not allow us to use some modelling functionalities (especially args in fit and predict methods) of Darts that are for models with past covariates.

import pandas as pd
import numpy as np
import statsmodels.api as sm

np.random.seed(0)

dates = pd.date_range(start='2024-01-01', periods=20, freq='D')
rain_levels = np.random.randint(5, 15, size=20)
sun_levels = np.random.randint(4, 10, size=20)
crop_volumes = 100 + rain_levels * 2 + sun_levels * 3 + np.random.normal(0, 5, size=20)
n_lags = 5

data = pd.DataFrame({
    'Date': dates,
    'Rain Level': rain_levels,
    'Sun Level': sun_levels,
    'Crop Volume': crop_volumes
})
data.set_index('Date', inplace=True)

for lag in range(1, n_lags):
    data[f'Rain Level Lag {lag}'] = data['Rain Level'].shift(lag)
    data[f'Sun Level Lag {lag}'] = data['Sun Level'].shift(lag)

data = data.dropna()

train_data = data.iloc[:-5]
test_data = data.iloc[-5:]

exog_vars = ['Rain Level', 'Sun Level'] + [f'Rain Level Lag {lag}' for lag in range(1, n_lags)] + [f'Sun Level Lag {lag}' for lag in range(1, n_lags)]
exog_train = train_data[exog_vars]
exog_test = test_data[exog_vars]

model = sm.tsa.ARIMA(train_data['Crop Volume'], order=(5, 0, 0), exog=exog_train, freq='D')
model_fit = model.fit()
predictions = model_fit.predict(start=len(train_data), end=len(train_data)+len(test_data)-1, exog=exog_test)
madtoinou commented 1 month ago

I see, you should be able to achieve the exact same things using Darts by converting your dataframe into two distinct series:

from darts import TimeSeries

tgt_series = TimeSeries.from_dataframe(train_data['Crop Volume'])
cov_series = TimeSeries.from_dataframe(train_data[exog_vars])

model = ARIMA(p=5, d=0, q=0)
# this is equivalent to using the `exog` argument of statsmodels
model.fit(tgt_series, future_covariates=cov_series)

To learn more about the Darts terminology, notably past/future covariates, I would recommend reading this section of the documentation.

furkancanturk commented 1 month ago

Thank you for your fast response.

Yes, I can achieve the same predictions with Darts. However, this implementation would require lots of custom pipelines. I've just checked quickly what would be necessary at first and see that lags_past_covariates and lags_future_covariates for fit and predict. I'm not use other stuff in Darts can be usable in that custom implementation. For example transformation needs to be applied before creating the lag features in this implementation while currently I apply transformation just before model fitting after completing all data preprocessing steps. Or, an ensemble modelling pipeline using Darts including this custom implementation would be overwhelming. I imagine (but not sure) that this custom implementation involves just a few lines of darts coding, which would have no benefit of a library usage.

I kindly ask if it is possible to include an update for ARIMA with past covariates in your development plan. I'd like to contribute to this, but I'm not sure if I have sufficient knowledge of the library implementaion to know which parts would need to be changed and how.

madtoinou commented 4 days ago

Hi @furkancanturk,

I don't think that developing a module to explicitly extract the lags from the target/covariates in ARIMA fits in the Darts roadmap; in theory, the model should be capable of extracting information from the series without performing such manipulations. What is the magnitude of the gain by proceeding like this (explicitly extracting lags)?

As you described, it should be possible to do it in the pre-processing "Darts Pipeline" (doc) to extract the desired lags in the target/covariates before fitting the model. However, this pipeline would not really be compatible with models that supports the lags_* arguments as the lagged features would become redundant.

In the current implementation, "Darts ARIMA" does support the exog argument which can be used to pass covariates to the model (what you are using in your code snippet). In Darts terminology, the model supports future covariates but not past covariates (see illustration in documentation), which is not really problematic since future covariates supports both lags into the past and the future.