Open ilyanoskov opened 3 months ago
To add more context, I was looking for a way to do an operation that is similar to diff
- take a previous row value, manipulate it, and then multiply by the next row value. I defined a window with hop=1, duration=1 but was only able to get those values after an additional join - which I found cumbersome.
This is a great point. Actually, diff
is a particularly unpleasant (slow and memory-inefficient) operation to perform in a streaming or distributed system in its full generality - and may cause unexpected performance bottlenecks. At the same time, it can be extremely useful. Whether or not it should be present as a keyword here is a recurring topic among repo maintainers.
If by any chance you are in the lucky place that your data has a column with sequential row numbers (seq = 1, 2, 3,...
), compute seq_next = seq+1
and join the table with itself on table_copy1.seq_next = table_copy2.seq
. It's not beautiful but it's (fairly) fast.
Next best case, if you can localize all the adjacent rows inside a small window, a reducer running a UDF over this window is a good idea.
In the most general case, you can use the sort
syntax to achieve the diff like here: https://pathway.com/developers/showcases/event_stream_processing_time_between_occurrences. We are working to optimize this approach for speed, but even then, it will always be a bit unwieldy for streaming systems at scale (especially with partitioned data).
Comment: I'm bookmarking https://cloud.google.com/bigquery/docs/reference/standard-sql/navigation_functions to have a terminology reference for Topic for future participants of the discussion (taken from the non-streaming window world).
I tried to implement some functionality that would give me the previous seen value inside a window and I have realised that it is currently quite cumbersome to do it in Pathway.
I think this feature will be useful for anyone trying to do not-so-trivial aggregations.