unit8co / darts

A python library for user-friendly forecasting and anomaly detection on time series.
https://unit8co.github.io/darts/
Apache License 2.0
8.1k stars 882 forks source link

[BUG] min_train_series_length does not take lags_past_covariates into consideration #1823

Open anne-devries opened 1 year ago

anne-devries commented 1 year ago

Describe the bug the min_train_series_length for lgbm, catboost, xgboost and regression_model, at the moment only considers target lags and output chunk length. However, this definition should also include the past_covariates_lags. E.g. -max(self.lags["target"][0], self.lags["past_covariates][0]) + self.output_chunk_length instead of -self.lags["target"][0] + self.output_chunk_length

Additional context link to gitter conversation: https://matrix.to/#/!uumevxjBaNJhovFYgj:gitter.im/$w4d7QoL4FaF3wXfxx0HeG9_iK5Rey5AfpY0kgazi9Ac?via=gitter.im&via=matrix.org&via=matrix.thegolem.cz

anne-devries commented 1 year ago

@dennisbader @madtoinou as discussed, I will work on this!

dennisbader commented 1 year ago

Hi @anne-devries, and thanks for taking a go at this.

We should rather have dedicated minimum length requirements for target, past and future covariates.

Otherwise, it can happen that we require more target time steps than actually required (for example with lags=[-1], lags_past_covariates[-1, -2]).

anne-devries commented 1 year ago

Hi @dennisbader , I don't really get why. When you instantiate a model with e.g. output_chunk_length = 2, lags = 2 and lags_past_covariates = 5, then the min_train_series_length would become 5, while to be able to train the model, we actually need 8. Also, I couldn't figure out where in the darts library this min_train_series_length property is used (it's not used as a check as far as I could find), could you clarify that for me? Thanks!

dennisbader commented 1 year ago

Hi @anne-devries, sure, let me explain: Darts models handle target and past/future covariates slicing under hood.

Let's look at an example:

Ex1: lags = [-1], lags_past_covariates = [-1], output_chunk_length = 1

Ex1: lags = [-1], lags_past_covariates = [-2], output_chunk_length = 1

We want to get the minimum required time spans per target/covariates rather than a global minum, because sometimes covariates are only available up to a specific points, and we want to allow for a maximum trainable time window.

Regarding where it's used: it's used for example in the fit methods as a sanity check that the series is long enough (we can also do this check for covariates), also in ForecastingModel.residuals(), ... . If you have an IDE, you can look for all occurences of the attribute in the code.

anne-devries commented 1 year ago

Hi Dennis, I think I now understand what you mean. So I will have a look and try to implement it for those 3 separately.