unit8co / darts

A python library for user-friendly forecasting and anomaly detection on time series.
https://unit8co.github.io/darts/
Apache License 2.0
8.12k stars 884 forks source link

Cannot fit RegressionModel (LinearRegression) with NaN values #2294

Closed Vitorbnc closed 3 months ago

Vitorbnc commented 8 months ago

Describe the bug I am trying to fit a model for a dataset that does not contain data for every single timestamp. The data frequency is Business Days. When I try to fit the model it says:

ValueError: Input X contains NaN. LinearRegression does not accept missing values encoded as NaN natively.

However, if I try to use dropna() to get rid of the missing days, when calling TimeSeries.from_dataframe() they are added again, still with NaN values.

Is there a way to fix this?

To Reproduce

import pandas as pd 
from darts import TimeSeries
from darts.utils.model_selection import train_test_split
from darts.models import RegressionModel
from sklearn.linear_model import LinearRegression

# Dataset source: https://www.kaggle.com/datasets/felsal/ibovespa-stocks?resource=download
ibov = pd.read_csv('../datasets/ibovespa/archive/b3_stocks_1994_2020.csv', parse_dates=['datetime'])
ibov_ts = TimeSeries.from_group_dataframe(ibov, time_col='datetime', freq='B', group_cols = ['ticker'], value_cols='close')

ibov_ts_sorted = sorted(ibov_ts,key=lambda x:len(x)) # Get longest series
ibov_ts_train, ibov_ts_test = train_test_split(ibov_ts_sorted[-100:], test_size=0.3, axis=1) #Split by datetime
m  = RegressionModel(lags=10, model=LinearRegression(), use_static_covariates=False)
#m.fit(ibov_ts_train) #This gives NaN error
ts_train_clean = TimeSeries.from_dataframe(ibov_ts_train[-1].pd_dataframe().dropna().reset_index(),time_col='datetime', freq='B', value_cols='close')
m.fit(ts_train_clean) #This also gives NaN error                               

Expected behavior Expected NaN values to be dropped, or training to work

System (please complete the following information):

Additional context I cannot fill missing values with interpolation or zeros, as that would not be realistic.

Vitorbnc commented 8 months ago

Also an unrelated question, is there a way to preserve or get the group_cols value when using TimeSeries.from_group_dataframe()? Say for plotting later on?

madtoinou commented 8 months ago

Hi @Vitorbnc,

The TimeSeries data-structure comes with some garantuees, one of them being : "TimeSeries are guaranteed to have a proper time index (integer or datetime based): complete and time-sorted." (doc).

Hence, if there are missing timestamps in your index, Darts will add them with NaN values. You're then responsible for replacing these values, for example with the MissingValuesFiller (doc) or any other method that you deem acceptable.

As for the TimeSeries.from_group_dataframe() method, you can use the drop_group_cols argument to control if the group_cols should be kept as static covariates or not (doc). Note that if these values are not numerical, you might need to convert them as most of the model don't support other format of static covariates.

Vitorbnc commented 8 months ago

@madtoinou thanks for the reply. So what if I want to keep the same group_cols as a static covariate? By default they seem to not be kept and the description of drop_group_cols says if specified there they will be dropped.

Also, what would be the most appropriate method for this type of financial data? Fill with something or try to slice the dataset into small continuous chunks for Darts?

moleary-gsa commented 7 months ago

@madtoinou It looks like the answer for this is to use another library then?

It's not possible to fill missing data always and that is exactly why we end up using the RegressionModel in those cases.

madtoinou commented 7 months ago

@Vitorbnc I am going to investigate that, if the groups_cols are dropped when the argument specifies otherwise, it's indeed a bug.

@moleary-gsa If you don't want to interpolate the missing dates (which should always be possible, with varying success), you can also split your TimeSeries into several TimeSeries (without any NaN values) using the gaps() method (doc, you could also check the longest_contiguous_slice() method) and use multi series training. If the missing dates are too frequent and the resulting TimeSeries slices are too short (to extract the desired lags), then, you could eventually replace the index with a RangeIndex and drop the empty rows, but it would mean that dates are supposed to be far away still end up next to each other (messing with the lag and temporal resolution of the model). Finally, you can always tabularize the data yourself and use sklearn implementations directly.

moleary-gsa commented 7 months ago

@madtoinou Thank you for your response.

The use case for me doesn't allow interpolation unfortunately. The data are wind forecasts where the data generation process may as well be Brownian motion! We also have gaps of a day or more in our history (on a 30min frequency) which would result in pretty useless data if we were to interpolate.

Of course if we see such gaps in data for online prediction we would fail to produce a forecast and raise an error so that is less of an issue here. The issue is that we would still like to be able to generate historical forecasts to run backtests on.

I will try the approach of splitting the data using the gaps and modelling multiple individual series and concatenating them together at the end (I will search for examples of this but if one comes to mind please share)

madtoinou commented 7 months ago

Another solution, not yet available (see #2362), would be have dummy values for the missing timestamps and then, to leverage samples weighting to ignore those values during training. Probably the most elegant solution for the training, won't help for the forecasting/backtest however.

Generate historical forecasts on these fragmented series will be tough, you will have to do a lot of things manually to recombine them and evaluate the results.

I cannot think of such example in the documentation/available resources, sorry

madtoinou commented 3 months ago

Closing this as the sample_weights features got released for regression models and allows to mask missing values from the training dataset.

Tracking the issues with groups_col in a separate issue.