pathwaycom / pathway

Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG.
https://pathway.com
Other
2.84k stars 98 forks source link

Add syntax to be able to reduce over multiple windows #19

Open ilyanoskov opened 3 months ago

ilyanoskov commented 3 months ago

I wonder if it's possible to have some additional syntax to make working with many windows possible?

For example, with ClickHouse SQL it's possible to work across many windows like this:

WITH sales AS (
    -- Your dataset source here
)
SELECT
    date,
    salesperson_id,
    region,
    amount,
    product_id,
    ROW_NUMBER() OVER w_region_amount AS row_number_region,
    RANK() OVER w_salesperson_amount AS rank_salesperson,
    DENSE_RANK() OVER w_product_amount AS dense_rank_product,
    SUM(amount) OVER w_region AS sum_sales_region,
    AVG(amount) OVER w_region AS avg_sales_region,
    MAX(amount) OVER w_salesperson AS max_sales_salesperson,
    MIN(amount) OVER w_product AS min_sales_product,
    LEAD(amount, 1) OVER w_salesperson_date AS lead_amount,
    LAG(amount, 1) OVER w_salesperson_date AS lag_amount,
    NTILE(10) OVER w_global_amount AS decile_rank_by_amount,
    FIRST_VALUE(salesperson_id) OVER w_region_amount AS top_salesperson_region,
    LAST_VALUE(salesperson_id) OVER w_region_amount_rows AS last_salesperson_region,
    COUNT(*) OVER w_region AS count_sales_region,
    PERCENT_RANK() OVER w_region_amount AS percent_rank_region,
    CUME_DIST() OVER w_region_amount AS cume_dist_region
FROM sales
WINDOW
    w_region AS (PARTITION BY region),
    w_salesperson AS (PARTITION BY salesperson_id),
    w_product AS (PARTITION BY product_id),
    w_region_amount AS (PARTITION BY region ORDER BY amount DESC),
    w_salesperson_amount AS (PARTITION BY salesperson_id ORDER BY amount DESC),
    w_product_amount AS (PARTITION BY product_id ORDER BY amount DESC),
    w_salesperson_date AS (PARTITION BY salesperson_id ORDER BY date),
    w_global_amount AS (ORDER BY amount DESC),
    w_region_amount_rows AS (PARTITION BY region ORDER BY amount DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
ORDER BY region, amount DESC;

With Pathway, I found myself having to define each window as a separate table, and then joining them back together. The resulting code was quite verbose.

Unless I am missing something and it's possible to do it succinctly with Pathway?

dxtrous commented 3 months ago

Defining each window as a separate table (with groupby) is probably a good idea. As for joining it all back, it is convenient to use ix_ref - here is an example in the SQL-window spirit to compare the salary of an employee to the average salary in their position & department: https://pathway.com/developers/user-guide/data-transformation/indexing-grouped-tables/#multi-values-indexing. Calling in @izulin - maybe you know of something cleaner. Either way, it would help to explain that this is the recommended way to implement SQL-windows (not to be confused with streaming windows).

ilyanoskov commented 3 months ago

Thanks a lot @dxtrous, will give it a try!

izulin commented 3 months ago

With Pathway, I found myself having to define each window as a separate table, and then joining them back together. The resulting code was quite verbose.

Unless I am missing something and it's possible to do it succinctly with Pathway?

Hi Ilya,

could you post an example of a verbose/unwieldy pathway code here? This could be a great starting point on how to extend syntax and/or tutorials.