[BUG] min_train_series_length does not take lags_past_covariates into consideration

anne-devries commented 1 year ago

Describe the bug the min_train_series_length for lgbm, catboost, xgboost and regression_model, at the moment only considers target lags and output chunk length. However, this definition should also include the past_covariates_lags. E.g. -max(self.lags["target"][0], self.lags["past_covariates][0]) + self.output_chunk_length instead of -self.lags["target"][0] + self.output_chunk_length

Additional context link to gitter conversation: https://matrix.to/#/!uumevxjBaNJhovFYgj:gitter.im/$w4d7QoL4FaF3wXfxx0HeG9_iK5Rey5AfpY0kgazi9Ac?via=gitter.im&via=matrix.org&via=matrix.thegolem.cz

anne-devries commented 1 year ago

@dennisbader @madtoinou as discussed, I will work on this!

dennisbader commented 1 year ago

Hi @anne-devries, and thanks for taking a go at this.

We should rather have dedicated minimum length requirements for target, past and future covariates.

Otherwise, it can happen that we require more target time steps than actually required (for example with lags=[-1], lags_past_covariates[-1, -2]).

anne-devries commented 1 year ago

Hi @dennisbader , I don't really get why. When you instantiate a model with e.g. output_chunk_length = 2, lags = 2 and lags_past_covariates = 5, then the min_train_series_length would become 5, while to be able to train the model, we actually need 8. Also, I couldn't figure out where in the darts library this min_train_series_length property is used (it's not used as a check as far as I could find), could you clarify that for me? Thanks!

dennisbader commented 1 year ago

Hi @anne-devries, sure, let me explain: Darts models handle target and past/future covariates slicing under hood.

For TorchForecastingModels, the time span requirements for target series, past/future covariates are identical, as we have fixed input_chunk_length and output_chunk_length and by convention we expect them to share the same time index.
For RegressionModels, the time span requirements for target, past covariates and future covariates should depend on their individual lags and output chunk length.

Let's look at an example:

Ex1: lags = [-1], lags_past_covariates = [-1], output_chunk_length = 1

we have output_chunk_length = 1, the past covariate at the last step of the training window is not required (-1 length)
we have identical lags, the start should be the same (+0 length) This results in min_len_past_covariates = min_len_target - 1 + 0 = min_len_target - 1

Ex1: lags = [-1], lags_past_covariates = [-2], output_chunk_length = 1

we have output_chunk_length = 1, the past covariate at the last step of the training window is not required (-1 length)
we have one more past cov lag -> the start of past covariates should be one time step before the target series start (+1 length) This results in min_len_past_covariates = min_len_target + 1 - 1 = min_len_target

We want to get the minimum required time spans per target/covariates rather than a global minum, because sometimes covariates are only available up to a specific points, and we want to allow for a maximum trainable time window.

Regarding where it's used: it's used for example in the fit methods as a sanity check that the series is long enough (we can also do this check for covariates), also in ForecastingModel.residuals(), ... . If you have an IDE, you can look for all occurences of the attribute in the code.

anne-devries commented 1 year ago

Hi Dennis, I think I now understand what you mean. So I will have a look and try to implement it for those 3 separately.

unit8co / darts

[BUG] min_train_series_length does not take lags_past_covariates into consideration #1823