Allow batch() option to not process empty windows?

quarks-edge / quarks

Quarks is an open source programming model and runtime for edge devices that enables you to analyze data and events at the device.

http://quarks-edge.github.io/

Apache License 2.0

123 stars 42 forks source link

Allow batch() option to not process empty windows? #75

Open ddebrunner opened 8 years ago

ddebrunner commented 8 years ago

The current batch implementation will process empty windows when the window is the last N seconds.

I wonder if this is the desired or expected behaviour. I see batching as breaking a stream into batches of tuples, thus processing an empty one doesn't seem to make much sense.

Batch could have an option to allow processing of empty windows, defaulting to false, or just not support it.

Throughts?

dlaboss commented 8 years ago

Doesn't feel like it's worth the addition... though maybe that's just because I haven't had to live with it as it is :-)

I imagine some uses would need to know that nothing was captured during the last N sec window and react as appropriate for the app (e.g., log, publish, generate a special tuple).

Seems clear/natural to expect the batcher is called every N sec with whatever tuples arrived during that time, hence the list may be empty.

Seems simple enough for uses that don't care about that case, if it doesn't fallout naturally from their processing, to just make the check. e.g.,

TStream s = s.last(10, SECONDS).batch(
                                      tuples -> { if (tuples.isEmpty()) return null;
                                      ... process the batch
                                      };

In a common case where "process the batch" involves a math3 analytic (e.g., SUM), maybe we define those analytics to return null if the input list is empty thereby eliminating having to write an explicit check?

ddebrunner commented 8 years ago

I somewhat agree with @dlaboss , but while it's simple to have the isEmpty() it doesn't help reduce cpu cost of potentially many schedules of a task that ends up doing nothing.

Maybe it's a YAGNI, and maybe it's more than just a boolean, but potentially a mode, e.g.:

Always process empty partition
Only process the first in a sequence of empty partition.
Never process empty partition
...

Maybe also it's a property of the window, so it applies to any window operation. If it's empty then do nothing for any window operation. This might also be easier to add in the future without breaking compatibility or having to have multiple overloads of batch().

vdogaru commented 8 years ago

I agree that implementing this as an empty partition policy is easy to add without breaking compatibility, but I think of handling empty windows as an attribute of the aggregate function rather than of the window.

Assume I want to support multiple aggregate functions triggered by the same window. To do that without extending Quarks, I have to create identical windows for each aggregate function. If this becomes unacceptable (e.g. high memory consumption if windows are large), and the alternative is to update the current window implementation to support a list of aggregates per window, then a per-window policy would then apply to all the aggregates.

ddebrunner commented 8 years ago

@vdogaru ?> To do that without extending Quarks, I have to create identical windows for each aggregate function.

Are you saying that is the case now, or if the empty partition policy was per window?

vdogaru commented 8 years ago

I am saying that adding an empty partition policy per window would force all the aggregates triggered by that window to share the window's policy.

ddebrunner commented 8 years ago

Right, but that doesn't force each aggregation to have its own window. Only ones with different policies.

wmarshall484 commented 8 years ago

that doesn't force each aggregation to have its own window

Currently, each time TWindow.aggregate is called, it creates a new Aggregate<T,U,K> oplet which contains its own Window. So in that sense, each aggregation does have its own window.

ddebrunner commented 8 years ago

@wmarshall484 A mere implementation detail, from the api they use a single window.

The implementation could be changed to have a single oplet for a window.

ddebrunner commented 8 years ago

Another situation that may not matter for Quarks being on the edge, but when there are thousands of partitions processing empty partitions can add up.