Improve kpss default lag for ndiffs

mitchelloharawild commented 5 years ago

Which lag is most appropriate? Old or new? Which gives best ARIMA performance on M3/M4?

davidreilly007 commented 5 years ago

Hi Mitchell,

Further to our twitter discussion, I thought it would be important before doing M3 or M4 runs, to answer the question, must Ln Airline Passengers be (0,1,1) (0,1,1). Because if so, you are constrained to ts short.

Running tsCV shows a slight edge for the (2,0,0) model. However here's a relevant comment from a past discussion with Rob:

DR: Here is an interesting change as a result of the kpss lag changes on a classical data set. If you model the log AirPassengers (Series G) data using auto.arima, the model (non-seasonal part) changes from the well known 0,1,1 to 2,0,1 with drift. If you then do a 24 month holdout and compare the two, the latter model is a better forecast!

RH: Interesting. I'm not sure it's good though. Only a seasonal difference will mean the trend is modelled as a drift term, which is not very adaptable. Two differences allows for local linear trends.

https://robjhyndman.com/hyndsight/show-me-the-evidence/#comment-3790363419

library(forecast) ap=ts(AirPassengers,start=c(1949,1),frequency=12) log.AP = log(ap)

fit0 <- function(x, h){ forecast(Arima(x,order=c(2,0,0),seasonal=list(order=c(0,1,1),period=12),include.constant=TRUE), h=h)} e0 <- tsCV(log.AP,fit0, h=1) rmse0 <- sqrt(mean(e0^2, na.rm=TRUE)) rmse0

fit1 <- function(x, h){ forecast(Arima(x,order=c(0,1,1),seasonal=list(order=c(0,1,1),period=12),include.constant=FALSE), h=h)} e1 <- tsCV(log.AP,fit1, h=1) rmse1 <- sqrt(mean(e1^2, na.rm=TRUE)) rmse1

library(forecast) ap=ts(AirPassengers,start=c(1949,1),frequency=12) log.AP = log(ap)

fit0 <- function(x, h){ forecast(Arima(x,order=c(2,0,0),seasonal=list(order=c(0,1,1),period=12),include.constant=TRUE), h=h)} e0 <- tsCV(log.AP,fit0, h=1) rmse0 <- sqrt(mean(e0^2, na.rm=TRUE)) rmse0 [1] 0.03984144

fit1 <- function(x, h){ forecast(Arima(x,order=c(0,1,1),seasonal=list(order=c(0,1,1),period=12),include.constant=FALSE), h=h)} e1 <- tsCV(log.AP,fit1, h=1) rmse1 <- sqrt(mean(e1^2, na.rm=TRUE)) rmse1 [1] 0.03993203

Edit: There is a slight difference in the models mentioned in the discussion (2,0,1) vs (2,0,0), I think due to the change in stepwise.

Edit 2: h=24, d=0 is better.

fit0 <- function(x, h){ forecast(Arima(x,order=c(2,0,0),seasonal=list(order=c(0,1,1),period=12),include.constant=TRUE), h=h)} e0 <- tsCV(log.AP,fit0, h=24) rmse0 <- sqrt(mean(e0^2, na.rm=TRUE)) rmse0

fit1 <- function(x, h){ forecast(Arima(x,order=c(0,1,1),seasonal=list(order=c(0,1,1),period=12),include.constant=FALSE), h=h)} e1 <- tsCV(log.AP,fit1, h=24) rmse1 <- sqrt(mean(e1^2, na.rm=TRUE)) rmse1

fit0 <- function(x, h){ forecast(Arima(x,order=c(2,0,0),seasonal=list(order=c(0,1,1),period=12),include.constant=TRUE), h=h)} e0 <- tsCV(log.AP,fit0, h=24) rmse0 <- sqrt(mean(e0^2, na.rm=TRUE)) rmse0 [1] 0.08031493 fit1 <- function(x, h){ forecast(Arima(x,order=c(0,1,1),seasonal=list(order=c(0,1,1),period=12),include.constant=FALSE), h=h)} e1 <- tsCV(log.AP,fit1, h=24) rmse1 <- sqrt(mean(e1^2, na.rm=TRUE)) rmse1 [1] 0.0831755

davidreilly007 commented 5 years ago

On the other hand there is elegance in the simplicity of (0,1,1) (0,1,1), even if it is not the “best” model.

davidreilly007 commented 5 years ago

So here’s an interesting result. I ran M3 with default forecast 8.9 settings for auto.arima and got MASE 1.454.

Then I ran with d=1 and got 1.402!!

That’s the best auto.arima M3 number I’ve seen.

Not for a moment suggesting that’s a solution but clearly the problem of over differencing is not as bad as underdifferencing!

tidyverts / feasts

Improve kpss default lag for ndiffs #42