nredell / forecastML

An R package with Python support for multi-step-ahead forecasting with machine learning and deep learning algorithms
Other
130 stars 23 forks source link

Big data #27

Closed nredell closed 4 years ago

nredell commented 4 years ago

These lagged datasets can get large: easily several GB to 100+ GB for a forecasting task like that in the M5 competition. The problem is compounded when, inside of train_model(), the input dataset needs to be converted into a special format, for example an xgboost matrix or a sparse matrix, because R will have to maintain 2 copies of the input data in memory. This is not cool.

Passing the entire direct-horizon-specific datasets into train_model() and further down the forecastML pipeline just doesn't make sense in big data cases. Instead, we'll make it so that you can pass a skeleton version of lagged_df into a number of functions and read the lagged dataframes from disk or a database or spark inside of train_model().

There will be a new function called create_skeleton() (Halloween-esque, eh?) to strip the training data but keep the meta-data from create_lagged_df(). A worked example with spark will accompany this functionality. This, and any small bug fixes, will be the final change before v0.9.0 ships to CRAN.

nredell commented 4 years ago

Done and done