sryza / spark-timeseries

A library for time series analysis on Apache Spark
Apache License 2.0
1.19k stars 424 forks source link

Create a TimeSeries-RDD #12

Closed kamir closed 9 years ago

kamir commented 9 years ago

Based on the idea of the TimeSeries Bucket, I suggest to define a "TimeSeriesRDD" for which several core functions are well defined, such as smoothing, filtering, "filling the gap", FFT, etc. The TimeSeriesRDD would help to abstract the storage details away and offers ways to project, aggregate or even expand the dataset.

In some cases we need event data, in other cases the spectrum is required in contextual normalization using the Time Resolved Relevance Index is an example for integration of additional structural information into the time series analysis procedure. Data wrangling is often not trivial. Because of this, it seems to be useful if a set of primitive transformations is already available as part of a specialized RDD.

sryza commented 9 years ago

Is there a reference for the "TimeSeries Bucket" you're referring to?

I'm wondering whether we need a special RDD for this or just a special class. In general, the time series that show up in finance aren't large enough that they need to be distributed (observations taken hourly for 20 years come out to tens of thousands of elements). Of course we may want RDDs that contain many of these time series. Could a time series RDD confer any benefits that a time series class couldn't?

cjnolet commented 9 years ago

IMO- you may want to provide utilities instead of a direct RDD because. For instance, my time series analysis algorithms are in windowed DStreams and not RDDs.

It's funny I found this project because I am actually working on something very similar for time series analysis. Ultimately, my goal is to blur the line between streaming and batch.

kamir commented 9 years ago

I was thinking again about the problem, especially to find an answer to: "Could a time series RDD confer any benefits that a time series class couldn't?" My (current) answer seems to be twofold: (A) there is no need to have an additional RDD as long as the time series are defined by a time series class. This class can hold also metadata derived from raw data and as long as the datasets are not too large we can have all in one class. But depending on the analysis type it might be helpful to define specific RDDs just to handle metadata and derived data appropriately.

For now, I conclude, no need for an additional RDD.

kamir commented 9 years ago

In between the project was renamed, cool. I did not realize this then I came back to the project. And the TimeSeriesRDD is also available now. This is great and makes my previous comment obsolete.

I think we can close the issue.

sryza commented 9 years ago

So I actually ended up going ahead and implementing a TimeSeriesRDD: https://github.com/cloudera/spark-timeseries/blob/master/src/main/scala/com/cloudera/sparkts/TimeSeriesRDD.scala. I haven't yet added any of the analysis functions you mentioned, but they could definitely be useful.

kamir commented 9 years ago

Great! I worked on cleaning up the "Hadoop.TS" code and prepare it for a pull request on a fresh fork of the spark-timeseries project.

First things I get ready are two random time series generators, and a simple "TimeSeriesBucket-Viewer". To collect some real world series on the fly I us the "StockDataLoader". All the things are in Java, not Scala, but I hope this is not an issue.

Have a great weekend! Cheers, Mirko

[image: cloudera] http://cloudera.com/ _Mirko Kämpf_Senior Instructor phone: +49 176 206 35 199 <+49%20160%209668%203050>skype: kamir1604 mail: mirko@cloudera.com Cloudera, Inc. http://cloudera.com/1001 Page Mill RoadBuilding 2Palo Alto, CA 94304

On Fri, Apr 17, 2015 at 7:36 PM, Sandy Ryza notifications@github.com wrote:

Closed #12 https://github.com/cloudera/spark-timeseries/issues/12.

— Reply to this email directly or view it on GitHub https://github.com/cloudera/spark-timeseries/issues/12#event-284101502.

Umair044 commented 8 years ago

example stock.scala is good. You implement it in java as well?

sryza commented 8 years ago

Hi @Umair044, here is the Java implementation: https://github.com/sryza/spark-ts-examples/blob/master/jvm/src/main/java/com/cloudera/tsexamples/JavaStocks.java

Umair044 commented 8 years ago

Hi sandy, Exception in thread "main" java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaUniverse$JavaMirror; at org.apache.spark.sql.types.NativeType.(dataTypes.scala:335) at org.apache.spark.sql.types.StringType.(dataTypes.scala:348) at org.apache.spark.sql.types.StringType$.(dataTypes.scala:364) at org.apache.spark.sql.types.StringType$.(dataTypes.scala)

This Exception occur in ur java code, What dependencies I have to add in this project, I have added following dependencies in pom.xml

org.apache.spark spark-core_2.10 1.0.0 org.apache.spark spark-mllib_2.10 1.0.2 org.apache.spark spark-sql_2.11 1.3.1

and add manually the jars file in java project,

1- protobuf-java-2.5.0-spark.jar

2- scala-reflect-2.11.2.jar

3- sparkts-0.3.0-jar-with-dependencies.jar

4- spark-cassandra-connector-assembly-1.4.0-SNAPSHOT.jar

5- joda-time-2.0.jar

From: Sandy Ryza [mailto:notifications@github.com] Sent: Tuesday, August 16, 2016 1:57 AM To: sryza/spark-timeseries Cc: Umair Iqbal; Mention Subject: Re: [sryza/spark-timeseries] Create a TimeSeries-RDD (#12)

Hi @Umair044https://github.com/Umair044, here is the Java implementation: https://github.com/sryza/spark-ts-examples/blob/master/jvm/src/main/java/com/cloudera/tsexamples/JavaStocks.java

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/sryza/spark-timeseries/issues/12#issuecomment-239926806, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AT_012zRWXOI61bZLSFwpjlmuPqxrRioks5qgNK0gaJpZM4DvqA7.

Umair044 commented 8 years ago

@sryza I need a little help from you.

sryza commented 8 years ago

@Umair044, it looks like you have artifacts with multiple versions of Scala (both 2.10 and 2.11), as well as multiple versions of Spark (1.0.0, 1.0.2, and 1.3.1). It's likely that one of these is causing the problem.

Umair044 commented 8 years ago

@sryza for your code what version should i use for it. So that i can study more about timeSeriesRDD.

Umair044 commented 8 years ago

@sryza Hello sandy, I want to ask one thing , Can I do data scaling/resampling through sparkts. If yes then any help can I get from you . Any classes/methods are in sparkts liberary ?

Umair044 commented 8 years ago

Hello sandy, I want to ask one thing , Can I do data scaling/resampling through sparkts. If yes then any help can I get from you . Any classes/methods are in sparkts liberary ?

Regards, Umair

From: Sandy Ryza [mailto:notifications@github.com] Sent: Tuesday, August 16, 2016 8:20 PM To: sryza/spark-timeseries Cc: Umair Iqbal; Mention Subject: Re: [sryza/spark-timeseries] Create a TimeSeries-RDD (#12)

@Umair044https://github.com/Umair044, it looks like you have artifacts with multiple versions of Scala (both 2.10 and 2.11), as well as multiple versions of Spark (1.0.0, 1.0.2, and 1.3.1). It's likely that one of these is causing the problem.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/sryza/spark-timeseries/issues/12#issuecomment-240135969, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AT_013Lch7VABmV_KyPDEVf-afSoaDFBks5qgdU7gaJpZM4DvqA7.

sryza commented 7 years ago

It's preferable to ask questions like these on the mailing list, where anyone with relevant context might be able to chip in.