twosigma / flint

A Time Series Library for Apache Spark
Apache License 2.0
1k stars 184 forks source link

usage of merge of a summarizer #45

Closed soloman817 closed 6 years ago

soloman817 commented 6 years ago

I created an experimental summarizer to try to understand how it works: https://github.com/soloman817/flint/blob/feature/experiment/src/test/scala/com/twosigma/flint/timeseries/experiment/AccumulateSummarizerSpec.scala

There I print informations when the methods of a summarizer are called, such as add, merge, etc.

I found that, if I call TimeSeriesRDD.summarize(), then the subtract will not be called, and if I have multiple partitions, the merge will be called. which means, aggregation on all rows will run in parallel and eventually merged into final result.

But if I run the summarizer with TimeSeriesRDD.summarizeWindow(), then for the window aggregation, add and subtract will be called, but not merge. Which means, inside one window, it is not parallel.

Am I right? This knowlege will be very helpful for my implementation to our problem, which is an extension of that experimental code.

Thanks, Xiang.

icexelloss commented 6 years ago

Which means, inside one window, it is not parallel.

This is correct.

soloman817 commented 6 years ago

Thanks.