pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.41k stars 1.97k forks source link

Parquet writer statistics for categorical/enum values have overall min/max instead of row_group min/max #18867

Open deanm0000 opened 1 month ago

deanm0000 commented 1 month ago

Checks

Reproducible example

import polars as pl
import pyarrow.parquet as pq

pl.DataFrame([
    pl.Series('a',['a','b','c'], pl.Categorical)]).write_parquet('cat.parquet',row_group_size=1)
pq.ParquetFile('cat.parquet').metadata.row_group(0).column(0).statistics
<pyarrow._parquet.Statistics object at 0x7ff7545f4680>
  has_min_max: True
  min: a
  max: c # This should be a
  null_count: 0
  distinct_count: None
  num_values: 1 # only 1 value, as expected
  physical_type: BYTE_ARRAY
  logical_type: String
  converted_type (legacy): UTF8

Log output

no stderr

Issue description

When writing a parquet file with a Categorical or Enum type the row group statistics are the overall stats not just what is in the row group. Pyarrow writes the row group statistics correctly

pl.DataFrame([
    pl.Series('a',['a','b','c'], pl.Categorical)]).write_parquet('catpa.parquet',row_group_size=1, use_pyarrow=True)
pq.ParquetFile('catpa.parquet').metadata.row_group(0).column(0).statistics
<pyarrow._parquet.Statistics object at 0x7ff7545b0a90>
  has_min_max: True
  min: a
  max: a # note a here
  null_count: 0
  distinct_count: None
  num_values: 1
  physical_type: BYTE_ARRAY
  logical_type: String
  converted_type (legacy): UTF8

Expected behavior

Polars's parquet writer should write the categorical/enum statistics according to the values not the overall categorical. Without this future predicate pushdowns won't be effective.

Installed versions

``` has_min_max: True min: a max: a null_count: 0 distinct_count: None num_values: 1 physical_type: BYTE_ARRAY logical_type: String converted_type (legacy): UTF8 ```
kszlim commented 1 month ago

Would be interesting to see if the Parquet PageIndex is being written appropriately too.