Prediction intervals for nnetar forecasts

dashaub commented 8 years ago

Since the nnetar object is an average of (default) 20 individual neural nets, could we build prediction intervals for the final nnetar forecast using the predictions from each individual net? This would seem especially appropriate when repeats is large. We're basically creating everything we need now for bootstrap prediction intervals but throwing them away when we wrap it up into a point forecast. Edit: several approaches are described here http://alumnus.caltech.edu/~amir/pred-intv-2.pdf

robjhyndman commented 8 years ago

No, this only accounts for a small part of the uncertainty. The inherent residual uncertainty would be ignored.

gabrielcaceres commented 8 years ago

If one considers the predictions from the different networks as a bootstrap estimate of the conditional expectation (arguable, since currently all networks are fitting the exact same data, just with their parameters being initialized differently) we could add, in a somewhat ad-hoc manner, an estimate of the error term by bootstrapping the residuals from the fits.

Otherwise, for a more formally developed approach, how about implementing one of the bootstrap approaches presented here since nnetar is essentially a nonlinear autoregressive model.

Their approach essentially follows (pg 7):

Fit the model and bootstrap the (centered) residuals
Iteratively generate a bootstrap series using the fitted parameters and bootstrapped residuals, conditioned on some random segment of the original series
Generate "future" observations using the fitted model, the last values of the original data, and the bootstrapped residuals
Fit a new model for the bootstrap series and compute predicted values conditioned on the end-points of the original data
Calculate the error between the bootstrap predictions and the bootstrap "future" values and use its empirical distribution from all the bootstrap series to estimate the prediction interval

This should account for variability in both the errors and the model estimation. I should note that they present this algorithm for linear autoregressive models, but they do state (on pg. 17) that it applies the same to nonlinear models (with some caveats/concerns for longer forecasts than the one-step-ahead predictions).

In fact, this could be implemented as a more general function, although a number of the other forecast methods already have their own specific bootstrap code. Do you think this sounds like a useful thing to do?

robjhyndman commented 8 years ago

The problem with Pan & Politis' algorithm is that it assumes stationarity, which will not necessarily be true for neural net models. They also spend a lot of effort on re-fitting models in order to take account of model uncertainty. I think we can get around a lot of that work with neural nets because the model uncertainty is already understood through the various networks with random starting values.

So something like this should do ok:

Fit the model and bootstrap the (centered) residuals
Simulate future sample paths using the last values of the original data, and the bootstrapped residuals, where each sample path is generated using a different network. (i.e., each observation is predicted from the preceding ones, and a residual added. Then repeat until the end of the sample path).
Use the empirical distribution from all the simulated future series to estimate the prediction interval.

That is not as sophisticated as P&P, but should do a pretty good job of getting sensible prediction intervals, taking account of both residual variance and model uncertainty.

Yes, I think that would be very useful.

gabrielcaceres commented 8 years ago

I agree that the Pan & Politis' algorithm would be much more computationally expensive. The one part I'm not completely sure of, is how much of the model uncertainty would be captured simply using the neural networks fit with random starting values. How many iterations are done and how much regularization is used can affect their degree of convergence and stability. On the one hand, if they're iterated for long enough, they should all converge closer together (depending on how multimodal the space may be). On the other hand, iterating for too long might lead them to overfit.

If they converge to fairly similar values, and all overfit together, using them to sample future paths would not indicate the model being incorrect. In that situation, re-fitting a new network with bootstrapped noise would better capture model overfitting.

If they're drastically different from each other, creating sample paths from the individual ones might not give a representative estimate of the uncertainty in the point-forecast, since that value is an average of all of them.

I attached a couple figures comparing how much they vary as the number of iterations and weight decay (i.e. regularization) change for a sample dataset. Note that the default parameters of nnet are maxit=100 and decay=0. The blue lines are 100 different single network predictions (not sample future paths, just the forecast from each network). The exact parameters I chose to display are somewhat arbitrary to depict the different behaviors and the effects would show up at different points depending on the size and complexity of the problem. I added the code I used for these simple plots below.

Nevertheless, I think of these points more as "food for thought" as I agree that your simplified algorithm should be the starting point anyway. Once there we can test the performance more directly and reassess if more is needed.

nn_variations_few-iter.pdf nn_variations_more-iter.pdf nn_variations_many-iter.pdf

library(forecast)
library(fpp)
set.seed(1234)
ntrain <- 120
ntest <- 164-ntrain
ts_train <- ts(usconsumption[1:ntrain,1])
ts_test <- ts(usconsumption[(ntrain+1):164,1])
xtrain <- usconsumption[1:ntrain,2]
xtest <- usconsumption[(ntrain+1):164,2]
##
iter <- 100
p <- 3
maxit <- 300
##
par(mfcol=c(2, 3))
for (decay in c(0, 0.3, 1)){
  plot(ts_train, xlim=c(0, 165), main=paste("without xreg,", ", decay =", decay, ", maxit =", maxit))
  for (i in 1:iter){
    nn_fit <- nnetar(ts_train, p=p, decay=decay, maxit=maxit)
    nn_fcast <- forecast(nn_fit, h=ntest)
    lines(nn_fcast$mean, col="blue")
    lines((length(ts_train)+1):164, ts_test)
  }
  ##
  plot(ts_train, xlim=c(0, 165), main=paste("with xreg,", ", decay =", decay, ", maxit =", maxit))
  for (i in 1:iter){
    nnxreg_fit <- nnetar(ts_train, p=p, decay=decay, maxit=maxit, xreg=xtrain, repeats=1)
    nnxreg_fcast <- forecast(nnxreg_fit,xreg=xtest, h=ntest)
    lines(nnxreg_fcast$mean,col="blue")
    lines((length(ts_train)+1):164, ts_test)
  }
}

robjhyndman commented 8 years ago

That doesn't look good! We will probably need to include model re-fitting.

gabrielcaceres commented 8 years ago

Oops, just noticed that for the top row of plots, I left out the repeats argument (and thus used the default of 20). The lines aren't as tight with the correct value, but the general point still stands.

robjhyndman / forecast

Prediction intervals for nnetar forecasts #241