These lagged datasets can get large: easily several GB to 100+ GB for a forecasting task like that in the M5 competition. The problem is compounded when, inside of train_model(), the input dataset needs to be converted into a special format, for example an xgboost matrix or a sparse matrix, because R will have to maintain 2 copies of the input data in memory. This is not cool.
Passing the entire direct-horizon-specific datasets into train_model() and further down the forecastML pipeline just doesn't make sense in big data cases. Instead, we'll make it so that you can pass a skeleton version of lagged_df into a number of functions and read the lagged dataframes from disk or a database or spark inside of train_model().
There will be a new function called create_skeleton() (Halloween-esque, eh?) to strip the training data but keep the meta-data from create_lagged_df(). A worked example with spark will accompany this functionality. This, and any small bug fixes, will be the final change before v0.9.0 ships to CRAN.
These lagged datasets can get large: easily several GB to 100+ GB for a forecasting task like that in the M5 competition. The problem is compounded when, inside of
train_model()
, the input dataset needs to be converted into a special format, for example anxgboost
matrix or a sparse matrix, becauseR
will have to maintain 2 copies of the input data in memory. This is not cool.Passing the entire direct-horizon-specific datasets into
train_model()
and further down theforecastML
pipeline just doesn't make sense in big data cases. Instead, we'll make it so that you can pass a skeleton version oflagged_df
into a number of functions and read the lagged dataframes from disk or a database or spark inside oftrain_model()
.There will be a new function called
create_skeleton()
(Halloween-esque, eh?) to strip the training data but keep the meta-data fromcreate_lagged_df()
. A worked example withspark
will accompany this functionality. This, and any small bug fixes, will be the final change before v0.9.0 ships to CRAN.