redpanda-data / connect

Fancy stream processing made operationally mundane
https://docs.redpanda.com/redpanda-connect/about/
8.13k stars 831 forks source link

Feature request: window buffer #2963

Closed danthegoodman1 closed 2 hours ago

danthegoodman1 commented 4 hours ago

Processing on a single record for AI inference limits rpcn to a really small number of simple use cases.

Being able to window records by time and/or length would unlock many orders of magnitude more usecases with AI inference. Combined with #2962 it could be used for custom inference.

For example, you could have a window of 30 records/30 seconds and run continual sentiment analysis over a stream of data to see how it changes over time. Single record is not enough context to perform accurate sentiment analysis of a larger conversation/discussion.

Crucially, you need something like a group_by option that would allow the windowing to be grouped up by some JSON path so that you could do things per-discussion, per-tenant, etc. Otherwise all context would be merged together, which would not be very useful when contexts are dynamic (e.g. new discussions being created and abandoned).

Can TTL the window after some time so it's not there forever consuming memory. No need to store bc it can be easily recovered on reboot if the data still exists.

rockwotj commented 3 hours ago

Hi @danthegoodman1 does this guide satisfy your requirements or does it fall short somewhere?

https://docs.redpanda.com/redpanda-connect/configuration/windowed_processing/

danthegoodman1 commented 2 hours ago

I didn’t happen to see that, my mistake. But that solution seems to be fixed tumbling windows, when really what I want is sliding windows (emit on every new record), unless I’m understanding the section on creating windows wrong?

rockwotj commented 2 hours ago

you can do sliding too: https://docs.redpanda.com/redpanda-connect/components/buffers/system_window/

danthegoodman1 commented 2 hours ago

Seems that probably does it!