lsst-sqre / kafka-aggregator

A Kafka aggregator based on the Faust Python Stream Processing library
https://kafka-aggregator.lsst.io
11 stars 1 forks source link

Support for SlidingWindow #119

Open fonty422 opened 3 years ago

fonty422 commented 3 years ago

Faust appears to support a sliding window. Any chance this package currently supports a sliding window and whether there are any limitations on how large that window can get in either time-size or record size before it no longer performs?

afausti commented 3 years ago

That would be an interesting addition to enable this package to compute moving averages, for example. Did you see any documentation on Sliding widow in Faust? the only options I found are Hopping and Tumbling windows.

As for the performance, Faust leverages Kafka's parallelism model. In principle, if you need a large sliding window or process large messages you can increase the number of topic partitions and run more Faust workers.

fonty422 commented 3 years ago

Yeah, there's a sliding window class, and I have an open issue on the faust-streaming repo which provides links to several classes and modules that might provide access. I just can't seem to get any of them to work as I would expect them to. I think in general people would like some basic aggregates - counts, mean, min, max, sum, in a continuously moving window. In my mind the basic way would be to perform a calculation/function when something enters the window, then perform the opposite function to remove that when one exits the window. Normal windows know when something leaves the window.

With my particular use case the issue is that some keyed messages might sporadically produce thousands of messages more than usual or than others, so a particular partition might become overloaded without warning and randomly. Considering the window is 3 months long I wondered whether there's a topic size limit to consider - say 1 message per second for a misbehaving/troubled input, that's ~8M records over the window period. My understanding (and correct me if I'm wrong) is that a stream processor (aka agent) can be run by multiple workers, but those workers can only read from specific partitions, so if you need to process by key then you need to ensure that the keys always end up in the same partition. that makes it painful when performing aggregations if a single partition by chance ends up with a a few million more records than all others. If those aggregations are held in a table, then each worker can only see it's own table and not those of others, meaning you can't split the workload across multiple workers.

If there's another way to do it, please let me know.