Closed Vitorbnc closed 3 months ago
Also an unrelated question, is there a way to preserve or get the group_cols
value when using TimeSeries.from_group_dataframe()
? Say for plotting later on?
Hi @Vitorbnc,
The TimeSeries
data-structure comes with some garantuees, one of them being : "TimeSeries are guaranteed to have a proper time index (integer or datetime based): complete and time-sorted." (doc).
Hence, if there are missing timestamps in your index, Darts will add them with NaN values. You're then responsible for replacing these values, for example with the MissingValuesFiller
(doc) or any other method that you deem acceptable.
As for the TimeSeries.from_group_dataframe()
method, you can use the drop_group_cols
argument to control if the group_cols
should be kept as static covariates or not (doc). Note that if these values are not numerical, you might need to convert them as most of the model don't support other format of static covariates.
@madtoinou thanks for the reply. So what if I want to keep the same group_cols
as a static covariate? By default they seem to not be kept and the description of drop_group_cols
says if specified there they will be dropped.
Also, what would be the most appropriate method for this type of financial data? Fill with something or try to slice the dataset into small continuous chunks for Darts?
@madtoinou It looks like the answer for this is to use another library then?
It's not possible to fill missing data always and that is exactly why we end up using the RegressionModel
in those cases.
@Vitorbnc I am going to investigate that, if the groups_cols
are dropped when the argument specifies otherwise, it's indeed a bug.
@moleary-gsa If you don't want to interpolate the missing dates (which should always be possible, with varying success), you can also split your TimeSeries
into several TimeSeries
(without any NaN values) using the gaps()
method (doc, you could also check the longest_contiguous_slice()
method) and use multi series training.
If the missing dates are too frequent and the resulting TimeSeries
slices are too short (to extract the desired lags), then, you could eventually replace the index with a RangeIndex
and drop the empty rows, but it would mean that dates are supposed to be far away still end up next to each other (messing with the lag and temporal resolution of the model). Finally, you can always tabularize the data yourself and use sklearn implementations directly.
@madtoinou Thank you for your response.
The use case for me doesn't allow interpolation unfortunately. The data are wind forecasts where the data generation process may as well be Brownian motion! We also have gaps of a day or more in our history (on a 30min frequency) which would result in pretty useless data if we were to interpolate.
Of course if we see such gaps in data for online prediction we would fail to produce a forecast and raise an error so that is less of an issue here. The issue is that we would still like to be able to generate historical forecasts to run backtests on.
I will try the approach of splitting the data using the gaps and modelling multiple individual series and concatenating them together at the end (I will search for examples of this but if one comes to mind please share)
Another solution, not yet available (see #2362), would be have dummy values for the missing timestamps and then, to leverage samples weighting to ignore those values during training. Probably the most elegant solution for the training, won't help for the forecasting/backtest however.
Generate historical forecasts on these fragmented series will be tough, you will have to do a lot of things manually to recombine them and evaluate the results.
I cannot think of such example in the documentation/available resources, sorry
Closing this as the sample_weights
features got released for regression models and allows to mask missing values from the training dataset.
Tracking the issues with groups_col
in a separate issue.
Describe the bug I am trying to fit a model for a dataset that does not contain data for every single timestamp. The data frequency is Business Days. When I try to fit the model it says:
However, if I try to use
dropna()
to get rid of the missing days, when callingTimeSeries.from_dataframe()
they are added again, still with NaN values.Is there a way to fix this?
To Reproduce
Expected behavior Expected NaN values to be dropped, or training to work
System (please complete the following information):
Additional context I cannot fill missing values with interpolation or zeros, as that would not be realistic.