kaskada-ai / kaskada

Modern, open-source event-processing
https://kaskada.io/
Apache License 2.0
349 stars 15 forks source link

feat: improve aggregations api #160

Open jbellis opened 1 year ago

jbellis commented 1 year ago

it looks like there's a ton of boilerplate involved in creating an aggregation function. aggregate is substantially identical across first_string, last_string, top_string; evaluate is identical across even more functions, and aggregate_since has duplication both across different functions, and also wrt aggregate in the same function.

additionally, it's not obvious why there are similar implementations for X and two_stacks_X for many of the functions.

bjchambers commented 1 year ago

FWIW: This relates at least partially to specialization and efficiency of the inner loops. I suspect there are ways to use some generic parameters to still get specialization, but some of the boilerplate exists because it gets inlined away and creates simpler inner loops. For instance, aggregate_since has conditionals that aggregate doesn't. In the case where the conditional isn't there, I believe we've seen the loop get auto-vectorized, but it won't if the conditional is there.

I think we absolutely should revisit to see if there are ways to reduce boilerplate and duplication. But we should also benchmark (and possibly look at generated assembly for some of the critical loops) and make sure we don't regress the potential for that to be vectorized, etc.