Not clear if Parquet statistics are used when filter applied

braaannigan commented 2 months ago

Checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

from datetime import datetime
import polars as pl

start = datetime(2000,1,1)
stop = datetime(2002,1,1,5)
df_long = (
    pl.DataFrame(
        {
            "datetime":pl.datetime_range(start,stop,interval="1m",eager=True)
        }
    )
    .join(
        pl.DataFrame(
            {
                "grp":[f"grp_{i:02}" for i in range(10)]
            }
        ),
        how="cross"
    )
    .with_columns(
        [
            pl.lit(1.0).alias(f"col_{j:02}") for j in range(10)
        ]
    )
)
# Write the files
long_parquet_path_no_stats = "data_long.parquet"
df_long.write_parquet(long_parquet_path_no_stats,statistics=False)
long_statistics_parquet_path = "data_long_statistics.parquet"
df_long.write_parquet(long_statistics_parquet_path,statistics=True)
# Time with no statistics
%%timeit -n1 -r3
(
    pl.scan_parquet(
        long_parquet_path_no_stats
    )
    .filter(
        pl.col("datetime") < datetime(2000,3,1)
    )
    .collect()
)
# Time with statistics
%%timeit -n1 -r3
(
    pl.scan_parquet(
        long_statistics_parquet_path
    )
    .filter(
        pl.col("datetime") < datetime(2000,3,1)
    )
    .collect()
)

Log output

parquet file must be read, statistics not sufficient for predicate.

This appears once for each log group

The stats themselves appear reasonable - here are the first row groups:
pqmeta = pq.read_metadata(long_statistics_parquet_path)
num_row_groups = pqmeta.num_row_groups
for i in range(num_row_groups):
    j=0 ## column index
    print(f"rg{i} min = {pqmeta.row_group(i).column(j).statistics.min} max = {pqmeta.row_group(i).column(j).statistics.max}")
rg0 min = 2000-01-01 00:00:00 max = 2000-01-19 06:43:00
rg1 min = 2000-01-19 06:43:00 max = 2000-02-06 13:26:00
rg2 min = 2000-02-06 13:27:00 max = 2000-02-24 20:10:00
rg3 min = 2000-02-24 20:10:00 max = 2000-03-14 02:53:00
rg4 min = 2000-03-14 02:54:00 max = 2000-04-01 09:37:00
rg5 min = 2000-04-01 09:37:00 max = 2000-04-19 16:20:00
rg6 min = 2000-04-19 16:21:00 max = 2000-05-07 23:04:00
rg7 min = 2000-05-07 23:04:00 max = 2000-05-26 05:47:00

Issue description

I've been trying to demonstrate the effect of statistics in Parquet files, but I'm not finding any effect - the query takes the same amount of time when reading with and without stats. With help from @deanm0000 I've seen that the verbose logging "statistics not sufficient for predicate" appears for each row group.

No verbose output appears if we use a pl.datetime instead of a python datetime, but the query is still no faster.

Any ideas what's happening?

Expected behavior

I would expect the statistics to be used resulting in a faster query.

P.S. - if I sort by the grp string column and do an equality filter the verbose output shows the statistics do get used and the query is 2x faster

Installed versions

``` --------Version info--------- Polars: 0.20.30 Index type: UInt32 Platform: Linux-5.10.104-linuxkit-x86_64-with-glibc2.28 Python: 3.10.1 (main, Dec 21 2021, 09:50:13) [GCC 8.3.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 3.0.0 connectorx: 0.3.3 deltalake: 0.17.4 fastexcel: 0.10.4 fsspec: 2024.6.0 gevent: hvplot: 0.10.0 matplotlib: 3.9.0 nest_asyncio: 1.6.0 numpy: 1.26.4 openpyxl: pandas: 2.2.2 pyarrow: 16.1.0 pydantic: pyiceberg: pyxlsb: sqlalchemy: 2.0.30 torch: xlsx2csv: 0.8.2 xlsxwriter: 3.2.0```

deanm0000 commented 2 months ago

Just as a quick follow up/elaboration, if the stats are on an int or bool then it'll work as expected. I think the issue is that not all the dtypes are fully implemented.

kcajf commented 1 month ago

I am also seeing this in Polars 1.0.0 with a parquet dataset containing a column of type Timestamp(isAdjustedToUTC=false, timeUnit=milliseconds, is_from_converted_type=false, force_set_converted_type=false)

pola-rs / polars