[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
import polars as pl
import pyarrow.parquet as pq
pl.DataFrame([
pl.Series('a',['a','b','c'], pl.Categorical)]).write_parquet('cat.parquet',row_group_size=1)
pq.ParquetFile('cat.parquet').metadata.row_group(0).column(0).statistics
<pyarrow._parquet.Statistics object at 0x7ff7545f4680>
has_min_max: True
min: a
max: c # This should be a
null_count: 0
distinct_count: None
num_values: 1 # only 1 value, as expected
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
Log output
no stderr
Issue description
When writing a parquet file with a Categorical or Enum type the row group statistics are the overall stats not just what is in the row group. Pyarrow writes the row group statistics correctly
pl.DataFrame([
pl.Series('a',['a','b','c'], pl.Categorical)]).write_parquet('catpa.parquet',row_group_size=1, use_pyarrow=True)
pq.ParquetFile('catpa.parquet').metadata.row_group(0).column(0).statistics
<pyarrow._parquet.Statistics object at 0x7ff7545b0a90>
has_min_max: True
min: a
max: a # note a here
null_count: 0
distinct_count: None
num_values: 1
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
Expected behavior
Polars's parquet writer should write the categorical/enum statistics according to the values not the overall categorical. Without this future predicate pushdowns won't be effective.
Checks
Reproducible example
Log output
Issue description
When writing a parquet file with a Categorical or Enum type the row group statistics are the overall stats not just what is in the row group. Pyarrow writes the row group statistics correctly
Expected behavior
Polars's parquet writer should write the categorical/enum statistics according to the values not the overall categorical. Without this future predicate pushdowns won't be effective.
Installed versions