sryza / spark-timeseries

A library for time series analysis on Apache Spark
Apache License 2.0
1.19k stars 423 forks source link

Current Arima Implementation Analysis #128

Open namrataghadi opened 8 years ago

namrataghadi commented 8 years ago

Hello ,

I am trying to use Arima for some forecasting project. I see that the Arima in spark has following issues: 1] Input Series x = c(1.0,1.0,1.0,1.0) Spark output : Infinite loop. It goes into infinite loop while calculating the AIC term. R output = Intercept = 1 and and AIC = Inf

2] Input series = x=c(1.0,2.0,3.0,4.0,5.0). Spark output = Arima model -> (1,0,0) Intercept = 3 AR term = -0.99999

R output = ARIMA Model -> (0,0,0)
intercept -> 1.5000

3] Input Series = x = x(1.0, 2.0,31.0,41.0,23.0,5.0, 24.0,31.0) Spark Output = Arima model -> (2,0,0) Intercept = 33.262567082534765 AR1 term = 0.21581672255253825 AR2 term = -0.6967797042410527

R output = ARIMA model -> (0,0,0)

Also, there is no support for multiple regressors like in arima in R. For exmaple, in R, I am able to do auto.arima(ts_x, xreg=ts_y)

Please advise.

sryza commented 8 years ago

Hi @namrataghadi . Thanks for reporting these issues. The auto ARIMA implementation in spark-ts is new, and thus pretty immature compared to R's offering. One limitation is that right now the spark-ts implementation only approximates likelihood through conditional sum of squares, while I believe R computes it directly.

Curious - have you tried fitting models of the same orders (i.e. without auto) and seeing whether they spit out the same coefficients?

Regarding multiple regressors, there's an early implementation of that here: https://github.com/sryza/spark-timeseries/blob/master/src/main/scala/com/cloudera/sparkts/models/RegressionARIMA.scala but it's not yet integrated with the autofitting functionality.

abafo22 commented 5 years ago

I am using sparkts to run ARIMA (autofit) model. I want to Perform Ljung–Box test on the Training Data to test the performance of my model/model fit. The Ljung–Box test is applied to the residuals of a fitted ARIMA model, not the original series.

The Box-Jenkins approach to time series (ARIMA estimation) is about fitting the data until the residuals are white noise. That's how you know that you have built an appropriate model. The test helps us numerically come to the conclusion that the series itself is not a white noise process and so its movements are not completely random

In sparkts, the Box-Jenkins is a method (function) from TimeSeriesStatisticalTests class, It takes the residual of the model and the maximum lag as its parameters. Then return the Q-Value & p-value.

lbtest(residuals: Vector, maxLag: Int): (Double, Double)

My problem is, How can I calculate residuals from the autofit ARIMA in sparkts?