pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.09k stars 1.94k forks source link

Pyarrow pushdown affected by other filters #12501

Open edavisau opened 11 months ago

edavisau commented 11 months ago

Checks

Reproducible example

from datetime import datetime, timedelta
import polars as pl
import pyarrow.dataset as ds

base_datetime = datetime.now()

dataset = ds.dataset(pl.DataFrame([
    pl.Series(name="int_col", values=[1, 2, 3, 4], dtype=pl.Int8),
    pl.datetime_range(base_datetime, base_datetime + timedelta(hours=3), interval="1h", time_unit="us", eager=True).alias("date_col"),
]).to_arrow())
print(
    "Dataset\n",
    dataset.head(5)
)

# Dataset
# pyarrow.Table
# int_col: int8
# date_col: timestamp[us]
# ----
# int_col: [[1,2,3,4]]
# date_col: [[2023-11-16 12:04:26.237219,2023-11-16 13:04:26.237219,2023-11-16 14:04:26.237219,2023-11-16 15:04:26.237219]]

print(
    "\nExample 1\n",
    pl.scan_pyarrow_dataset(dataset)
    .filter(pl.col("int_col").gt(2))
    .explain(optimized=True)
)

# Example 1
# 
#  PYTHON SCAN 
#  PROJECT */2 COLUMNS
#  SELECTION: (pa.compute.field('int_col') > 2)

print(
    "\nExample 2\n",
    pl.scan_pyarrow_dataset(dataset)
    .filter(pl.col("int_col").gt(2))
    .filter(pl.col("date_col").eq(base_datetime))
    .explain()
)

# Example 2
# FILTER [([(col("int_col")) > (2)]) & ([(col("date_col")) == (2023-11-16 12:04:26.237219)])] FROM
#
#  PYTHON SCAN 
#  PROJECT */2 COLUMNS

Log output

No response

Issue description

If you take a filter which is supported by the pyarrow pushdown, and then add another filter which is not supported, the first filter is no longer pushed down.

(Was not sure whether this should be classified as a bug or enhancement)

Expected behavior

In the second example above, we should see something like

Example 2
 FILTER [(col("date_col")) == (2023-11-16 12:09:59.661913)] FROM

  PYTHON SCAN 
  PROJECT */2 COLUMNS
  SELECTION: (pa.compute.field('int_col') > 2)

Installed versions

``` --------Version info--------- Polars: 0.19.13 Index type: UInt32 Platform: Linux-5.15.0-60-generic-x86_64-with-glibc2.35 Python: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] ----Optional dependencies---- adbc_driver_sqlite: cloudpickle: 2.2.1 connectorx: deltalake: fsspec: 2023.1.0 gevent: matplotlib: 3.7.0 numpy: 1.23.5 openpyxl: 3.1.1 pandas: 1.5.3 pyarrow: 14.0.1 pydantic: 1.10.5 pyiceberg: pyxlsb: sqlalchemy: 1.4.46 xlsx2csv: 0.8.1 xlsxwriter: ```
edavisau commented 11 months ago

Note that a workaround for now is to pushdown manually until the filter pushdown is implemented in polars

The simplest way being

pl.scan_pyarrow_dataset(
    dataset
    .filter(ds.field("date_col") == base_datetime)
)
.filter(pl.col("int_col").gt(2))

This only applies to filter expressions which can be expressed in pyarrow.