twosigma / flint

A Time Series Library for Apache Spark
Apache License 2.0
993 stars 184 forks source link

add more information in Summarizer fromV #49

Closed soloman817 closed 5 years ago

soloman817 commented 5 years ago

In this interface: https://github.com/twosigma/flint/blob/master/src/main/scala/com/twosigma/flint/timeseries/summarize/Summarizer.scala#L211

Sometime it is also good to know on which row it is rending. Is it possible to have something like: def fromV(v: V, t: T): InternalRow

icexelloss commented 5 years ago

fromV() should be only called once after all rows are summarized, so there is no particular row associated with it. Can you elaborate why do you think it should be associate with an input row?

soloman817 commented 5 years ago

Hi thanks for the reply, I think I figured it out now, it is not needed. The reason is, in https://github.com/twosigma/flint/blob/master/src/main/scala/com/twosigma/flint/rdd/function/window/SummarizeWindows.scala#L833 , the state is stored as key -> state, where the key is constructed by the time and the key. So you actually will be able to know which row you are rendering if you store them during state creating in the add function.

soloman817 commented 5 years ago

Forgot to mention, I was using summarizer in a window summarization, so the render function will be called on each row, to generate results for that row.