unit8co / darts

A python library for user-friendly forecasting and anomaly detection on time series.
https://unit8co.github.io/darts/
Apache License 2.0
8.1k stars 882 forks source link

Reduce saved models file size #2429

Closed jlopezpena closed 2 months ago

jlopezpena commented 5 months ago

Is your feature request related to a current problem? Please describe.

Training a global forecasting model that uses LightGBM on a relatively large dataset, and then saving the resulting trained model, results in a massive file (over 1GB). The model itself is not complex enough to warrant this size, so I am guessing Darts is storing the training dataset alongside the trained model. This is inconvenient for two reasons:

Describe proposed solution There should be an option passed to the model.save method, or even better, a method (something like model.prune() that would get rid of any and all data artifacts that are not required for inference. As this might break some existing functionality that could rely on the dataset being present, it would be acceptable to have this "pruned model" be a separate class with reduced functionality: basically, just the stuff needed for prediction no need for further training, backtesting, or anything like that. If a model is needed for those purposes, the full thing can still be stored, but a thin alternative for deployment would be very useful.

madtoinou commented 5 months ago

Hi @jlopezpena,

The training series in indeed stored in the self.training_series attribute if the model is fitted on a single series. It simplifies the prediction step, during which Darts can assume that the user is trying to forecast n values after the end of this training series and the user doesn't have to pass it again.

You can easily overwrite/remove it before saving your model with mode.training_series = None. The only downside is that you will have to always provide an input series during inference. The covariates are also stored in self.past_covariate_series and self.future_covariate_series, you can remove them with the same approach.

We could indeed make it an argument/dedicated method to make it a bit simpler.

jlopezpena commented 5 months ago

Thanks for your answer @madtoinou ! Providing a series for inference is fine, and it is actually the desired way of operation for a global forecasting model that has been trained in multiple series. Will test your suggestion and report back on the outcome!

madtoinou commented 5 months ago

Just FYI, if the model has been trained on multiple series, the training series/covariates are not saved :)

jlopezpena commented 5 months ago

Yeah, I just realised that. Looks like it is not the training data, but multiple copies of the trained model getting stored for each of the prediction horizons, looks like I have about 30 copies of LGBMRegressor, each of them taking about 35MB, stored in the model.model.estimators_ attribute. Probably not much can be done about that, unfortunately 😞

madtoinou commented 4 months ago

You can actually have an impact on this but the performance of the model is likely to decrease a little bit; there are 30 models, one for each step/position in the output_chunk_length. If you set multi_models=False when you create the model, only of one of them will be created and the lags will be shifted in the past for each position (see illustration).

There is a trade off between model size & performance sadly. Since you have such a long ocl, you might be able to reduce a bit the size of the model by reducing the number of lags (input features) but again, it will probably negatively impact the model's forecasting capabilities.

madtoinou commented 2 months ago

Closing this issue, just realized it's a duplicate of #1836.