pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.3k stars 1.86k forks source link

"WHERE true/false" breaks in SQLContext on DataFrame with null columns #18373

Closed cmadlener closed 1 week ago

cmadlener commented 3 weeks ago

Checks

Reproducible example

import polars as pl

def main():
    df = pl.DataFrame({"null_column": [None, None]})

    ctx = pl.SQLContext({"df": df}, eager=True)
    result = ctx.execute("SELECT * FROM df WHERE true")
    print(result)

if __name__ == "__main__":
    main()

Log output

Traceback (most recent call last):
  File "/home/christoph/polars-select-where-true/main.py", line 13, in <module>
    main()
  File "/home/christoph/polars-select-where-true/main.py", line 8, in main
    result = ctx.execute("SELECT * FROM df WHERE false")
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/christoph/polars-select-where-true/.venv/lib/python3.11/site-packages/polars/sql/context.py", line 440, in execute
    return res.collect() if (eager or self._eager_execution) else res
           ^^^^^^^^^^^^^
  File "/home/christoph/polars-select-where-true/.venv/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 2027, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ShapeError: filter's length: 1 differs from that of the series: 2

Issue description

In practice, I'm not actually using true and false as conditions, but statements like 'a' = 'a' (which are dynamically created). Everything works as expected if the empty columns have a string type.

Expected behavior

I get the entire DataFrame when the WHERE clause evaluates to true, and an empty DataFrame when it evaluates to false.

Installed versions

``` --------Version info--------- Polars: 1.5.0 Index type: UInt32 Platform: Linux-6.8.0-40-generic-x86_64-with-glibc2.35 Python: 3.11.3 (main, May 8 2023, 10:18:23) [GCC 11.3.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: gevent: great_tables: hvplot: matplotlib: nest_asyncio: numpy: 2.1.0 openpyxl: pandas: pyarrow: 17.0.0 pydantic: pyiceberg: sqlalchemy: torch: ```
alexander-beedie commented 1 week ago

@cmadlener: thanks for this one - have made an update, and it'll be supported in the upcoming 1.7.0 release.

cmadlener commented 1 week ago

Thank you, @alexander-beedie for the quick response and fix! Looking forward to using it!

cmadlener commented 6 days ago

I just had a chance to try the latest release. I have a follow-up question: Would it be feasible to recursively apply the shortcutting logic to SQLExpr::Nested? Right now, if you slightly modify the WHERE clause by wrapping it in parenthesis, you will get the same exception as before.

import polars as pl

df = pl.DataFrame({"null_column": [None, None]})

ctx = pl.SQLContext({"df": df}, eager=True)
result = ctx.execute("SELECT * FROM df WHERE (true)")

Another case (that may also be deemed practically not relevant) is if you have multiple trivial clauses, e.g. WHERE true AND 'a' = 'a'.

Looking at the phrasing in #18571 it might have been deliberate to only cover the "top-level" case. It makes sense to me that we don't want to more or less parse the whole expression to figure out the shortcutting. It makes me wonder, though, if the root cause here lies somewhere deeper in parse_sql_expr resp. SQLExprVisitor and should be fixed there.