pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
26.9k stars 1.65k forks source link

ColumnNotFoundError when doing SQL `GROUP BY` on a column projection (function on column) #16258

Open jaihind213 opened 2 weeks ago

jaihind213 commented 2 weeks ago

Checks

Reproducible example

import polars as pl
from io import StringIO
my_csv = StringIO(
"""
"run_id","taxi_color","ride_time"
1,Red,2024-05-13 08:30:00
2,Yellow,2024-05-13 10:45:00
3,Green,2024-05-13 13:15:00
4,Blue,2024-05-13 16:20:00
5,Blue,2025-05-13 16:20:00
"""
)
df = pl.read_csv(my_csv, try_parse_dates=True)
ctx = pl.SQLContext(register_globals=True, eager_execution=False)
ctx.register("taxis", df)
query = """SELECT count(*) as total, 
EXTRACT(YEAR FROM ride_time) AS yr 
from taxis group by yr;
"""
sql_result = ctx.execute(query)
sql_result_collected = sql_result.collect()
print(sql_result_collected)

Log output

Traceback (most recent call last):
  File "/Users/vishnuch/work/gitcode/duckberg/polar2.py", line 20, in <module>
    sql_result = ctx.execute(query)
  File "/Users/vishnuch/work/utils/mamba/envs/duckberg/lib/python3.10/site-packages/polars/sql/context.py", line 269, in execute
    res = wrap_ldf(self._ctxt.execute(query))
polars.exceptions.ColumnNotFoundError: yr

Issue description

when we apply a function on a column and try to do group by on that projection, we get column not found.

Expected behavior

the query should return:

yr, total
2024, 4
2025, 1

ps: the query works in postgres and duckdb

Installed versions

``` --------Version info--------- Polars: 0.20.26 Index type: UInt32 Platform: macOS-12.6.4-arm64-arm-64bit Python: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: gevent: hvplot: matplotlib: nest_asyncio: numpy: 1.26.4 openpyxl: pandas: 2.2.1 pyarrow: 15.0.2 pydantic: pyiceberg: pyxlsb: sqlalchemy: torch: xlsx2csv: xlsxwriter: ```