Open mitchelloharawild opened 5 years ago
Hi Mitchell,
Further to our twitter discussion, I thought it would be important before doing M3 or M4 runs, to answer the question, must Ln Airline Passengers be (0,1,1) (0,1,1). Because if so, you are constrained to ts short.
Running tsCV shows a slight edge for the (2,0,0) model. However here's a relevant comment from a past discussion with Rob:
DR: Here is an interesting change as a result of the kpss lag changes on a classical data set. If you model the log AirPassengers (Series G) data using auto.arima, the model (non-seasonal part) changes from the well known 0,1,1 to 2,0,1 with drift. If you then do a 24 month holdout and compare the two, the latter model is a better forecast!
RH: Interesting. I'm not sure it's good though. Only a seasonal difference will mean the trend is modelled as a drift term, which is not very adaptable. Two differences allows for local linear trends.
https://robjhyndman.com/hyndsight/show-me-the-evidence/#comment-3790363419
library(forecast) ap=ts(AirPassengers,start=c(1949,1),frequency=12) log.AP = log(ap)
fit0 <- function(x, h){ forecast(Arima(x,order=c(2,0,0),seasonal=list(order=c(0,1,1),period=12),include.constant=TRUE), h=h)} e0 <- tsCV(log.AP,fit0, h=1) rmse0 <- sqrt(mean(e0^2, na.rm=TRUE)) rmse0
fit1 <- function(x, h){ forecast(Arima(x,order=c(0,1,1),seasonal=list(order=c(0,1,1),period=12),include.constant=FALSE), h=h)} e1 <- tsCV(log.AP,fit1, h=1) rmse1 <- sqrt(mean(e1^2, na.rm=TRUE)) rmse1
library(forecast) ap=ts(AirPassengers,start=c(1949,1),frequency=12) log.AP = log(ap)
fit0 <- function(x, h){ forecast(Arima(x,order=c(2,0,0),seasonal=list(order=c(0,1,1),period=12),include.constant=TRUE), h=h)} e0 <- tsCV(log.AP,fit0, h=1) rmse0 <- sqrt(mean(e0^2, na.rm=TRUE)) rmse0 [1] 0.03984144
fit1 <- function(x, h){ forecast(Arima(x,order=c(0,1,1),seasonal=list(order=c(0,1,1),period=12),include.constant=FALSE), h=h)} e1 <- tsCV(log.AP,fit1, h=1) rmse1 <- sqrt(mean(e1^2, na.rm=TRUE)) rmse1 [1] 0.03993203
Edit: There is a slight difference in the models mentioned in the discussion (2,0,1) vs (2,0,0), I think due to the change in stepwise.
Edit 2: h=24, d=0 is better.
fit0 <- function(x, h){ forecast(Arima(x,order=c(2,0,0),seasonal=list(order=c(0,1,1),period=12),include.constant=TRUE), h=h)} e0 <- tsCV(log.AP,fit0, h=24) rmse0 <- sqrt(mean(e0^2, na.rm=TRUE)) rmse0
fit1 <- function(x, h){ forecast(Arima(x,order=c(0,1,1),seasonal=list(order=c(0,1,1),period=12),include.constant=FALSE), h=h)} e1 <- tsCV(log.AP,fit1, h=24) rmse1 <- sqrt(mean(e1^2, na.rm=TRUE)) rmse1
fit0 <- function(x, h){ forecast(Arima(x,order=c(2,0,0),seasonal=list(order=c(0,1,1),period=12),include.constant=TRUE), h=h)} e0 <- tsCV(log.AP,fit0, h=24) rmse0 <- sqrt(mean(e0^2, na.rm=TRUE)) rmse0 [1] 0.08031493 fit1 <- function(x, h){ forecast(Arima(x,order=c(0,1,1),seasonal=list(order=c(0,1,1),period=12),include.constant=FALSE), h=h)} e1 <- tsCV(log.AP,fit1, h=24) rmse1 <- sqrt(mean(e1^2, na.rm=TRUE)) rmse1 [1] 0.0831755
On the other hand there is elegance in the simplicity of (0,1,1) (0,1,1), even if it is not the “best” model.
So here’s an interesting result. I ran M3 with default forecast 8.9 settings for auto.arima and got MASE 1.454.
Then I ran with d=1 and got 1.402!!
That’s the best auto.arima M3 number I’ve seen.
Not for a moment suggesting that’s a solution but clearly the problem of over differencing is not as bad as underdifferencing!
Which lag is most appropriate? Old or new? Which gives best ARIMA performance on M3/M4?