kaskada-ai / kaskada

Modern, open-source event-processing
https://kaskada.io/
Apache License 2.0
351 stars 15 forks source link

feat: Duration-based sliding windows #380

Open kerinin opened 1 year ago

kerinin commented 1 year ago

Currently, you must provide a fixed width for sliding windows, for example

Foo | count(window=sliding(10, ...))

It would be useful to have the ability to configure the window width with respect to time, rather than count, for example:

Foo | count(window=preceding(minutes(10)))

The behavior of this window would be to produce a value at the time of each input. The value would be computed by applying the aggregation to all inputs that arrived within the preceding time duration.

This type of window has shown up in a number of existing feature definitions.

bjchambers commented 1 year ago

This makes sense. I think we need to think about how exactly it fits with other ways of expressing functions -- we know we want to revamp things there, so ideally it all "fits together" in a reasonable way.

One issue with this kind of window is that it requires a dynamic amount of storage. "10 minute windows sliding every minute" always requires storage for 10 partial aggregates. Because this kind of window suggests event-level granularity, it needs to store all of the events within those 10 minutes, which could be large.

Another option we've looked at was being able to parameterize this with the granularity. So we could say "10 minutes sliding every event", but making that extreme granularity explicit.

jordanrfrazier commented 1 year ago

We do have the ability to specify the window width with respect to time, but in a very limited fashion.

Foo | count(window=since(hourly()))

Is a (hopping) hourly window.

The example stated, as Ben says, requires two axis: the slide (per event) and the window width (every 10 minutes).

There have been proposals to adopt language more similar to what Microsoft uses here: https://learn.microsoft.com/en-us/stream-analytics-query/hopping-window-azure-stream-analytics

which would let us say:

let my_window=hopping(window_unit=minute, window_width=10, slide_unit=events, slide_duration=1)
Foo | count(window=my_window)

This is much more explicit in parameterizing the window duration from the slide (or step, hop, etc).

kerinin commented 1 year ago

The example of since(hourly()) doesn't specify the window width with respect to time - each window's width is different. This example specifies the window's leading edge with respect to time. The proposal here is to provide a fixed-width window whose trailing edge aligns with each event.

I think the second example describes the logic we want, but I don't love the naming or arguments. Whenever possible, I think we should try to emphasize the implicit "now" that timelines give us - IDK if "preceding" is the right name, but I like that it anchors the window's definition at a point in time. WRT arguments, one of the strengths of the syntax is that it's concise and readable - I think we should try to avoid functions with a ton of arguments. For example, hopping(minute, 10, events, 1) is confusing to me - it only works with the parameter names in there, and then it ends up being really long. I think we're better off providing a handful of functions with a small number of arguments (ideally a single argument) over one function to rule them all via a bunch of argument-based configurations.