mlr-org / mlr3temporal

Forecasting for mlr3
https://mlr3temporal.mlr-org.com
GNU Lesser General Public License v3.0
20 stars 2 forks source link

prediction does not work when training on whole dataset #39

Closed corneliagru closed 4 years ago

corneliagru commented 4 years ago

when training the learner on all rows, prediction is not possible anymore. (Error in learner$predict(task) : No timesteps left for prediction)

`

autoarima

learner = LearnerRegrForecastAutoArima$new() tsk = mlr_tasks$get("airpassengers") learner$train(tsk) learner$predict(tsk)

var

task = tsk("petrol") learner = LearnerRegrForecastVAR$new() learner$train(task) p = learner$predict(task) `

corneliagru commented 4 years ago

i think we should include agruments for the prediction function as well, to ensure flexibility for the user. e.g. for forecast() h , level and so on

johannes9522 commented 4 years ago

During the $predict call forecasts are made starting from the last timestamp in the training set. If the model was trained on the whole dataset, it is not clear what exactly should happen when calling $predict. Do we want to return the fitted values of the model?

johannes9522 commented 4 years ago

Right now the forecast horizon for the prediction is set by the row_ids argument in the predict method. Confidence Intervals can be computed with a helper function from the standard errors. Are there more useful arguments for the prediction?

corneliagru commented 4 years ago

Right now the forecast horizon for the prediction is set by the row_ids argument in the predict method. Confidence Intervals can be computed with a helper function from the standard errors. Are there more useful arguments for the prediction?

all the arguments for the forecast function can be found here https://cran.r-project.org/web/packages/forecast/forecast.pdf on page 47

I think it should be possible to adjust all the arguments

corneliagru commented 4 years ago

During the $predict call forecasts are made starting from the last timestamp in the training set. If the model was trained on the whole dataset, it is not clear what exactly should happen when calling $predict. Do we want to return the fitted values of the model?

so the general problem is that when using only the packages it is possible to do something like this:

`library(forecast) library(tsbox)

mdeaths mod = auto.arima(mdeaths) forecast(mod) `

so you train your model on all observations and it is still possible to predict values afterwards

pfistfl commented 4 years ago
  1. predict

    Do we want to return the fitted values of the model?

I think this would be very cool! This would allow us to see how good our model fits the training data.

Open Questions:

  1. forecast

So the difference is, that forecast just forecasts starting from the last time-point in the training data, while we obtain this info from the test data.

Through $predict we basically only support predicting data we already have, as $predict always expects there to be data.

An experimental idea we might want to do is the following:

.$forecast = function(horizon = 5L) {
  # 1. Get the last training time-point in the data
  # 2. Create "artificial" data that has observations for `last_train_time` + horizon rows
  # 3. Call self$predict_internal on this. 
}

Open question: How would that look like with exogenous variables etc.?

This would allow me to call lrn$forecast(5) and get predictions for e.g. 5 days in the future.

johannes9522 commented 4 years ago

47 should fix the issues when predicting on row_ids that have been used for training. This means predictions will now return fitted values for those row_ids.

For actual forecasting (beyond the last timestamp in the training data): Does it make sense to store the forecasts in a Prediction object with "fake" data? Since the truth for those forecasts is not available and prediction$score will be misleading.

pfistfl commented 4 years ago

Well, we can not score it but I guess the purpose of any forecasting model is, to eventually forecast unseen data. As a result, I guess it is ok to create fake data, in the sense that we only extend the date colum.

corneliagru commented 4 years ago

Fixed in PR #47 by Johannes