robjhyndman / forecast

Forecasting Functions for Time Series and Linear Models
http://pkg.robjhyndman.com/forecast
1.13k stars 342 forks source link

Scaling inputs in 'nnetar' #235

Closed gabrielcaceres closed 8 years ago

gabrielcaceres commented 8 years ago

Currently, nnetar scales the input series by dividing by the largest absolute value

  # Scale data
  scale <- max(abs(xx),na.rm=TRUE)
  xx <- xx/scale

and xreg is not currently being scaled (as mentioned in #205) but I'm planning on adding that.

I was wondering if you had a preference at how the scaling is done.

Looking into this, I've come across some references that suggest, for numerical conditioning, either standardizing the inputs or scaling them to [-1,1], and argue against [0,1].

From here and here:

There is a common misconception that the inputs to a multilayer perceptron must be in the interval [0,1]. There is in fact no such requirement, although there often are benefits to standardizing the inputs as discussed below. But it is better to have the input values centered around zero, so scaling the inputs to the interval [0,1] is usually a bad choice.

and from here

In general, any shift of the average input away from zero will bias the updates in a particular direction and thus slow down learning. Therefore, it is good to shift the inputs so that the average over the training set is close to zero [...] Convergence in faster not only if the inputs are shifted as described above but also if they are scaled so that all have about the same covariances.

Although these are somewhat old references, I found them when linked in more recent stackoverflow and stackexchange questions.

In contrast, Venables and Ripley (2002), the reference provided in the nnet package, seems to argue towards a [0,1] scaling on page 245 when describing the use of weight decay for regularization:

Weight decay, specific to neural networks, uses as penalty the sum of squares of the weights wij . (This only makes sense if the inputs are rescaled to range about [0, 1] to be comparable with the outputs of internal units.)

and they also scale inputs to [0,1] in one of their examples.

I'm leaning towards standardizing the inputs, and also modifying the scaling of the original series for consistency (perhaps with an optional argument in the nnetar call for whether to center/scale?). In my (limited) experience, this performs well including when weight decay is used.

Any thoughts on how it should be implemented? just scale by the maximum like the current code, or standardizing, [0,1], [-1,1] scaling?

robjhyndman commented 8 years ago

I've no idea. I have very limited experience with neural nets.