Open cjnolet opened 9 years ago
I agree, this would be a great addition. Any chance this is something you'd be interested in contributing @cjnolet ?
Honestly, I wouldn't mind helping out and my company wouldn't have a problem letting me spend a few hours on it. I was talking to one of the Spark guys about time series data today. If we tie this API to dataframes, we may be able to design a completely generic solution and eventually contribute it directly to Spark. Is there any interest in that?
Awesome!
If it seems like it's a good fit, I'm not opposed to working on and possibly contributing back tight dataframe integration. My opinion is that the more stats-y time series modeling aspects of this project should probably stay outside of Spark, as it's already pretty bloated.
I am really interested in implementing ARIMA so I can try to help here where possible.
Hey @dsdinter, glad to hear about the interestt! I know that @josepablocam is currently working on this. Perhaps you can find a way to split up tasks?
@dsdinter I jotted out a draft for a non-seasonal arima, but nothing complete enough worth sharing yet, so more than happy to coordinate something jointly. Were you thinking about seasonal/non-seasonal?
Hi @josepablocam & @sryza I am actually looking at the non-seasonal one, similarly to Madlib ARIMA implementation (I.e. To forecast Timeseries values): http://doc.madlib.net/master/group__grp__arima.html
@dsdinter sorry about the delayed reply. I'm planning on cleaning up the sketch I currently have for the non-seasonal arima tomorrow and will share with you and see how best to go forward.
Hi @josepablocam, no worries, looking forward to looking at your sketch and see where and how I can help. We can maybe focus on each of the sections of ARIMA, i.e. AR vs MA terms (Divide and conquer).
Actually there is already an AR module in the current package, maybe one should focus on the Differencing section and the other on the MA then.
@dsdinter I'll share what I have by EOD. I trying to get the parameters fitted with CSS to match up with the ones in R's stat:arima. I'll post regardless of success though. I think they might be currently off because a) I'm using a different optimization method (math3 commons BOBYQAOptimizer), and b) different initial guess for parameters.
@dsdinter I've pushed what I currently have for ARIMA. Current parameter fitting is done using conditional sum of squares, with the math3 BOBYQA optimizer (so no derivative provided). I think a lot of this needs to be reworked but wanted to avoid delaying sharing. You can see what's there so far on the arima branch of my fork. I quickly compared to what results from this stackoverflow question. Seems differences stem from initialization of parameters (along with the optimization method). I'll probably work on cleaning this up and then adding exogenous variables at some point this week.
@dsdinter @sryza I reworked what I currently have for the non-seasonal arima. Thinking about it a bit more, I'm not entirely sure exogenous variables should be added to this implementation, since it doesn't seem in keep with the rest of the models in there so far (which are all functions of endogenous variables).
On another note, I've been comparing the parameter estimates vs R's arima, and results seem fine (as long as R's call uses "CSS" as well). The largest deviations tend to be in the intercept term. I'm going to take a closer look to see how R is initializing that.
@josepablocam apologies for the delay, I will be looking at this over the weekend as I have been quite busy at work.
Thanks for sharing!
@dsdinter no worries, I've been changing a lot of it, so actually probably best that you haven't taken a look yet
Not sure if you had the chance but I have been looking at how ARIMA got implemented in Madlib: https://github.com/madlib/madlib https://github.com/madlib/madlib/blob/master/src/modules/tsa/arima.cpp
It's c++ but the mathematical approach is there anyway.
@dsdinter @sryza I've pushed what I have so far to my repo at https://github.com/josepablocam/spark-timeseries/tree/arima I also added some tests. I've left removeTimeDependentEffects commented out, since what I was doing doesn't seem right to me, but wasn't clear what the right approach was. I've left the commented out code so you can see what I was doing though.
Tests included: 1 - fitting a time series generated by R's stats:arima.sim should result in relatively close parameters to those used to generate the series 2 - sampling from a model, and then fitting the sampled series should result in a similar model 3 - fitting an ARIMA(p, d, q) to series X should be equivalent to fitting an ARMA(p, q) to a X that has been differenced at order d
Any more test suggestions are of course welcomed.
I will go through the madlib code this week. I haven't gone through it yet.
Mind submitting a pull request for the branch? It's fine if it's not in a final state, it will just be easier to comment on.
doneso. Labeled as WIP https://github.com/cloudera/spark-timeseries/pull/40
Is there any implementation going on for "Seasonal" ARIMA?
Hi @SupunS, there isn't currently anyone working on seasonal ARIMA.
Is the ARIMA model available for JAVA?
Hi @sryza ,I am looking a way out how do we use DataFrame in ARIMA Any suggestions ?
@SupunS, do you know if we will still be implementing s-ARIMA?
please add ARIMAX to spark libraries, please please !
Would be massively useful to have a Seasoned ARIMA model for the time series analysis.