pathwaycom / pathway

Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG.
https://pathway.com
Other
2.84k stars 98 forks source link

Add lead / lag / first_value / last_value - like reducers. #18

Open ilyanoskov opened 3 months ago

ilyanoskov commented 3 months ago

I tried to implement some functionality that would give me the previous seen value inside a window and I have realised that it is currently quite cumbersome to do it in Pathway.

I think this feature will be useful for anyone trying to do not-so-trivial aggregations.

ilyanoskov commented 3 months ago

To add more context, I was looking for a way to do an operation that is similar to diff - take a previous row value, manipulate it, and then multiply by the next row value. I defined a window with hop=1, duration=1 but was only able to get those values after an additional join - which I found cumbersome.

dxtrous commented 3 months ago

This is a great point. Actually, diff is a particularly unpleasant (slow and memory-inefficient) operation to perform in a streaming or distributed system in its full generality - and may cause unexpected performance bottlenecks. At the same time, it can be extremely useful. Whether or not it should be present as a keyword here is a recurring topic among repo maintainers.

If by any chance you are in the lucky place that your data has a column with sequential row numbers (seq = 1, 2, 3,...), compute seq_next = seq+1 and join the table with itself on table_copy1.seq_next = table_copy2.seq. It's not beautiful but it's (fairly) fast.

Next best case, if you can localize all the adjacent rows inside a small window, a reducer running a UDF over this window is a good idea.

In the most general case, you can use the sort syntax to achieve the diff like here: https://pathway.com/developers/showcases/event_stream_processing_time_between_occurrences. We are working to optimize this approach for speed, but even then, it will always be a bit unwieldy for streaming systems at scale (especially with partitioned data).

dxtrous commented 3 months ago

Comment: I'm bookmarking https://cloud.google.com/bigquery/docs/reference/standard-sql/navigation_functions to have a terminology reference for Topic for future participants of the discussion (taken from the non-streaming window world).