pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.6k stars 1.99k forks source link

Parquet scanner doesn't do predicate pushdown for categoricals/enums #18868

Open deanm0000 opened 2 months ago

deanm0000 commented 2 months ago

Checks

Reproducible example

pl.DataFrame([
    pl.Series('a',['a','b','c'], pl.Categorical)]).write_parquet('catpa.parquet',row_group_size=1, use_pyarrow=True)
with pl.Config(verbose=True):
    pl.scan_parquet("catpa.parquet").filter(pl.col('a')=='a').collect()

Log output

parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.

Issue description

This is separate from but highly related to https://github.com/pola-rs/polars/issues/18867. Even when using a file written by pyarrow where the statistics are correct, predicate pushdown doesn't work.

If I try to explicitly make the rhs a Categorical then I simply don't get a verbose message at all so I'm not sure if it's silently working or not working.

with pl.Config(verbose=True):
    print(pl.scan_parquet("catpa.parquet").filter(pl.col('a')==pl.lit('a',pl.Categorical)).collect())

Even with a StringCache still no verbosity.

with pl.Config(verbose=True), pl.StringCache():
    print(pl.scan_parquet("catpa.parquet").filter(pl.col('a')==pl.lit('a',pl.Categorical)).collect())

Expected behavior

Partition pruning should work

Installed versions

``` --------Version info--------- Polars: 1.7.1 Index type: UInt32 Platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35 Python: 3.11.9 (main, Apr 6 2024, 17:59:24) [GCC 11.4.0] ----Optional dependencies---- adbc_driver_manager altair cloudpickle connectorx 0.3.2 deltalake 0.18.2 fastexcel fsspec 2024.3.1 gevent great_tables matplotlib nest_asyncio 1.6.0 numpy 1.26.4 openpyxl pandas 2.2.2 pyarrow 17.0.0 pydantic pyiceberg sqlalchemy torch xlsx2csv xlsxwriter ```
aberres commented 2 months ago

The same happens with enum columns.

aberres commented 11 hours ago

Some numbers from a real live example:

CleanShot 2024-11-29 at 15 07 15@2x

What I am reading is a ~100Mb Parquet representing a ~30Gb in memory frame. The filter matches no rows.

In the first case, the "Assumption" column has been written as string column, in the second case as categorical column. The data is residing in a Google Cloud bucket in Singapore, this means I/O is costly.

As we see, the missing push-down has a dramatic effect.