pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.9k stars 1.92k forks source link

SQLContext window function misinterpretation when combining PARTITION BY and ORDER BY #14635

Open tobackes opened 8 months ago

tobackes commented 8 months ago

Checks

Reproducible example

import polars as pl
import sqlite3

QUERY = "SELECT business_id, year, month, AVG(avg_stars) OVER (PARTITION BY business_id ORDER BY year,month) AS avg_stars_so_far FROM business_monthly_review_counts ORDER BY business_id, year, month"

con = sqlite3.connect('test.db')
cur = con.cursor()

lf_table = pl.LazyFrame( cur.execute("SELECT * FROM business_monthly_review_counts"),
                         schema = [ 'business_id', ('year',pl.Int16), ('month',pl.Int8), 'freq', ('avg_stars',pl.Float32) ]
                       )

con_table  = pl.SQLContext(business_monthly_review_counts=lf_table)
lf_query_1 = con_table        .execute( QUERY )
lf_query_2 = pl.LazyFrame( cur.execute( QUERY ),
                           schema = [ 'business_id', ('year',pl.Int16), ('month',pl.Int8), ('avg_stars_so_far',pl.Float32) ]
                         )

minimal_example = { 'groupcol':['a','a','a','b','b','b'], 'ordercol':[1,2,3,4,5,6], 'aggcol':[1,1,1,1,1,1] }
lf_min          = pl.LazyFrame(minimal_example)
con_min         = pl.SQLContext(frame=lf_min)
lf_min_query    = con_min.execute("SELECT groupcol, ordercol, SUM(aggcol) OVER (PARTITION BY groupcol ORDER BY ordercol) AS aggsum FROM frame")

print(pl.__version__)
print(lf_query_1[:20].collect())
print(lf_query_2[:20].collect())
print(lf_min_query   .collect())

Log output

0.20.10
shape: (20, 4)
┌────────────────────────┬──────┬───────┬──────────────────┐
│ business_id            ┆ year ┆ month ┆ avg_stars_so_far │
│ ---                    ┆ ---  ┆ ---   ┆ ---              │
│ str                    ┆ i16  ┆ i8    ┆ f32              │
╞════════════════════════╪══════╪═══════╪══════════════════╡
│ --6MefnULPED_I942VcFNA ┆ 2008 ┆ 8     ┆ 3.122222         │
│ --6MefnULPED_I942VcFNA ┆ 2010 ┆ 11    ┆ 3.122222         │
│ --6MefnULPED_I942VcFNA ┆ 2010 ┆ 12    ┆ 3.122222         │
│ --6MefnULPED_I942VcFNA ┆ 2011 ┆ 1     ┆ 3.122222         │
│ --6MefnULPED_I942VcFNA ┆ 2011 ┆ 5     ┆ 3.122222         │
│ …                      ┆ …    ┆ …     ┆ …                │
│ --6MefnULPED_I942VcFNA ┆ 2015 ┆ 9     ┆ 3.122222         │
│ --6MefnULPED_I942VcFNA ┆ 2015 ┆ 12    ┆ 3.122222         │
│ --6MefnULPED_I942VcFNA ┆ 2016 ┆ 4     ┆ 3.122222         │
│ --6MefnULPED_I942VcFNA ┆ 2016 ┆ 5     ┆ 3.122222         │
│ --6MefnULPED_I942VcFNA ┆ 2016 ┆ 8     ┆ 3.122222         │
└────────────────────────┴──────┴───────┴──────────────────┘
shape: (20, 4)
┌────────────────────────┬──────┬───────┬──────────────────┐
│ business_id            ┆ year ┆ month ┆ avg_stars_so_far │
│ ---                    ┆ ---  ┆ ---   ┆ ---              │
│ str                    ┆ i16  ┆ i8    ┆ f32              │
╞════════════════════════╪══════╪═══════╪══════════════════╡
│ --6MefnULPED_I942VcFNA ┆ 2008 ┆ 8     ┆ 5.0              │
│ --6MefnULPED_I942VcFNA ┆ 2010 ┆ 11    ┆ 4.5              │
│ --6MefnULPED_I942VcFNA ┆ 2010 ┆ 12    ┆ 3.5              │
│ --6MefnULPED_I942VcFNA ┆ 2011 ┆ 1     ┆ 3.375            │
│ --6MefnULPED_I942VcFNA ┆ 2011 ┆ 5     ┆ 3.3              │
│ …                      ┆ …    ┆ …     ┆ …                │
│ --6MefnULPED_I942VcFNA ┆ 2015 ┆ 9     ┆ 3.40625          │
│ --6MefnULPED_I942VcFNA ┆ 2015 ┆ 12    ┆ 3.323529         │
│ --6MefnULPED_I942VcFNA ┆ 2016 ┆ 4     ┆ 3.416667         │
│ --6MefnULPED_I942VcFNA ┆ 2016 ┆ 5     ┆ 3.289474         │
│ --6MefnULPED_I942VcFNA ┆ 2016 ┆ 8     ┆ 3.375            │
└────────────────────────┴──────┴───────┴──────────────────┘
shape: (6, 3)
┌──────────┬──────────┬────────┐
│ groupcol ┆ ordercol ┆ aggsum │
│ ---      ┆ ---      ┆ ---    │
│ str      ┆ i64      ┆ i64    │
╞══════════╪══════════╪════════╡
│ a        ┆ 1        ┆ 3      │
│ a        ┆ 2        ┆ 3      │
│ a        ┆ 3        ┆ 3      │
│ b        ┆ 4        ┆ 3      │
│ b        ┆ 5        ┆ 3      │
│ b        ┆ 6        ┆ 3      │
└──────────┴──────────┴────────┘

Issue description

There seems to be a bug in the execution of window functions in polars SQLContext.

This is a combination of PARTITION BY and ORDER BY.

(Totally inappropriate and unqualified question: Are you sure you cannot use some existing SQL engine somehow? I saw there were already previous bugs in interpreting window function queries.)

Expected behavior

The expected behavior is shown in the second table in the output. The problematic behavior is shown in the first table.

In the minimal example, the sum should be over the previous row values until the new group stars and then begin from 1.

That means it should be [1,2,3,1,2,3]

Installed versions

``` --------Version info--------- Polars: 0.20.10 Index type: UInt32 Platform: Linux-6.5.0-10022-tuxedo-x86_64-with-glibc2.35 Python: 3.12.1 | packaged by conda-forge | (main, Dec 23 2023, 08:03:24) [GCC 12.3.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fsspec: gevent: hvplot: matplotlib: numpy: 1.26.3 openpyxl: pandas: 2.1.1 pyarrow: pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```
alexander-beedie commented 6 months ago

Are you sure you cannot use some existing SQL engine somehow?

Our SQL interface translates directly to Polars calls, so clearly we cannot use a different one. If you want/need to use a different engine you can just go and use that engine - but we can't as we're translating a SQL query to Polars-native code ;)

I saw there were already previous bugs in interpreting window function queries.

Do you have some links to the related Issues? Could be helpful to link them together. If not, do feel free to submit them as new Issues or attach them to this one 👍