Open furkancanturk opened 1 month ago
Hi @furkancanturk,
Darts is directly relying on the implementation of the ARIMA from statsmodels, pmdarima and statsforecast. To my knowledge, none of these models support past covariates as Darts defined them. Do you maybe have a reference of an implementation that would support them?
Note that the constraints on future covariates are more severe than those on past covariates, meaning that if a model supports future covariates, you can pass past covariates as future covariates.
The following example trains an ARIMAX model with future and lagged exog variables by generating lag features explicitly. However, this kind of workflow diverges from Darts pipelines I implemented. I guess implementing a separate pipeline for ARIMAX would not allow us to use some modelling functionalities (especially args in fit and predict methods) of Darts that are for models with past covariates.
import pandas as pd
import numpy as np
import statsmodels.api as sm
np.random.seed(0)
dates = pd.date_range(start='2024-01-01', periods=20, freq='D')
rain_levels = np.random.randint(5, 15, size=20)
sun_levels = np.random.randint(4, 10, size=20)
crop_volumes = 100 + rain_levels * 2 + sun_levels * 3 + np.random.normal(0, 5, size=20)
n_lags = 5
data = pd.DataFrame({
'Date': dates,
'Rain Level': rain_levels,
'Sun Level': sun_levels,
'Crop Volume': crop_volumes
})
data.set_index('Date', inplace=True)
for lag in range(1, n_lags):
data[f'Rain Level Lag {lag}'] = data['Rain Level'].shift(lag)
data[f'Sun Level Lag {lag}'] = data['Sun Level'].shift(lag)
data = data.dropna()
train_data = data.iloc[:-5]
test_data = data.iloc[-5:]
exog_vars = ['Rain Level', 'Sun Level'] + [f'Rain Level Lag {lag}' for lag in range(1, n_lags)] + [f'Sun Level Lag {lag}' for lag in range(1, n_lags)]
exog_train = train_data[exog_vars]
exog_test = test_data[exog_vars]
model = sm.tsa.ARIMA(train_data['Crop Volume'], order=(5, 0, 0), exog=exog_train, freq='D')
model_fit = model.fit()
predictions = model_fit.predict(start=len(train_data), end=len(train_data)+len(test_data)-1, exog=exog_test)
I see, you should be able to achieve the exact same things using Darts by converting your dataframe into two distinct series:
from darts import TimeSeries
tgt_series = TimeSeries.from_dataframe(train_data['Crop Volume'])
cov_series = TimeSeries.from_dataframe(train_data[exog_vars])
model = ARIMA(p=5, d=0, q=0)
# this is equivalent to using the `exog` argument of statsmodels
model.fit(tgt_series, future_covariates=cov_series)
To learn more about the Darts terminology, notably past/future covariates, I would recommend reading this section of the documentation.
Thank you for your fast response.
Yes, I can achieve the same predictions with Darts. However, this implementation would require lots of custom pipelines. I've just checked quickly what would be necessary at first and see that lags_past_covariates and lags_future_covariates for fit and predict. I'm not use other stuff in Darts can be usable in that custom implementation. For example transformation needs to be applied before creating the lag features in this implementation while currently I apply transformation just before model fitting after completing all data preprocessing steps. Or, an ensemble modelling pipeline using Darts including this custom implementation would be overwhelming. I imagine (but not sure) that this custom implementation involves just a few lines of darts coding, which would have no benefit of a library usage.
I kindly ask if it is possible to include an update for ARIMA with past covariates in your development plan. I'd like to contribute to this, but I'm not sure if I have sufficient knowledge of the library implementaion to know which parts would need to be changed and how.
Hi @furkancanturk,
I don't think that developing a module to explicitly extract the lags from the target/covariates in ARIMA fits in the Darts roadmap; in theory, the model should be capable of extracting information from the series without performing such manipulations. What is the magnitude of the gain by proceeding like this (explicitly extracting lags)?
As you described, it should be possible to do it in the pre-processing "Darts Pipeline" (doc) to extract the desired lags in the target/covariates before fitting the model. However, this pipeline would not really be compatible with models that supports the lags_*
arguments as the lagged features would become redundant.
In the current implementation, "Darts ARIMA" does support the exog
argument which can be used to pass covariates to the model (what you are using in your code snippet). In Darts terminology, the model supports future covariates but not past covariates (see illustration in documentation), which is not really problematic since future covariates supports both lags into the past and the future.
Why is ARIMA with past covariates not implemented in the library?