sryza / spark-timeseries

A library for time series analysis on Apache Spark
Apache License 2.0
1.19k stars 424 forks source link

Documentation should include comprehensive examples of how to use the different constructs #34

Open cjnolet opened 9 years ago

cjnolet commented 9 years ago

I'm looking through the code to figure out how to use the DenseVector and the UniformTimeIndex, for instance, and I've had parse quite a bit of code to figure them out.

I think it would be massively useful to establish a good user manual or some type of documentation module now while the project is still young rather than trying to do it later when there's many more algorithms.

sryza commented 9 years ago

Definitely agree that this is needed

dashrathc commented 9 years ago

Add more examples and also suggest how to visualize it. It would be better to understand.

mattweyant commented 9 years ago

Agreed. Having some way to visualize would be a big win.

sryza commented 9 years ago

Thanks for the suggestions @dashrathc and @mattweyant. When you say "how to visualize it", are you talking about tools for visualizing time series, or visualizations of the different constructs to help illuminate the API?

dashrathc commented 9 years ago

@sryza thanks for quick response.

I am using library in zeppelin so let's say, // Remove serial correlations val iidRdd = slicedRdd.mapSeries(series => ar(series, 1).removeTimeDependentEffects(series))

Now, how can we visualize iidRDD ? Array[(String, breeze.linalg.Vector[Double])] = Array((yahooFinance-2.csvOpen,DenseVector(5.796518650687262, -0.18321398725336557, -0.18386777312624858, -0.24702155899913159, 0.31460672650369137, -0.43430363037483755, 0.0641708692551024, -0.06061120212060356, 0.0641708692551024, 0.31438879787939644, -0.028271558999131585, -0.02838052331127905, 0.0027605123765743755)))

So there should be some explanation about how to visualize the result.

sryza commented 9 years ago

Ah, yeah, it's a good question. There are a couple challenges.

The first is that Scala doesn't have great visualization libraries in the way that languages like R and Python do. I've had some success with breeze-viz, and spark-timeseries has a couple utilities methods in EasyPlot that make it possible to display single plots with a single line. E.g. to look at the first series in iidRdd:

 import com.cloudera.sparkts.EasyPlot._
 ezplot(iidRdd.first._2)

Or to compare the first five series on a single plot:

 ezplot(iidRdd.take(5).map(_._2))

The second challenge is that, conceptually, visualizing large collections of time series can be difficult. Five time series on a single plot might be easy to grok, but a plot of a thousand would just look like a jumble. In these cases, usually distilling the dataset down to some summary statistics, or taking cross-sections at particular moments is required.

Lastly, I'm working on Python bindings for the project right now. So once those are available, Python users will be able to get data easily into tools like matplotlib.

dashrathc commented 9 years ago

@sryza Thanks for quick reply and example to visualize result. I have tried it and it' working fine.

mattweyant commented 9 years ago

I was originally envisioning a way to visualize the time-series, but having some visualizations to illuminate the API would be extremely helpful. With respect to visualizing time-series, I've been playing with wisp and it seems like it has potential.