pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.38k stars 1.97k forks source link

No statistics in Parquet file after setting statistics=True #13642

Closed braaannigan closed 10 months ago

braaannigan commented 10 months ago

Checks

Reproducible example

import polars as pl
# Create a df
filename_stats = "stats2.pq"
df = pl.DataFrame({"a":list(range(1_000_000))})
# Write to Parquet with statistics
df.write_parquet(filename_stats,statistics=True)
# Inspect the Parquet metadata and print it out
import pyarrow.parquet as pq

# Replace 'your_parquet_file.parquet' with your Parquet file path
parquet_file = pq.ParquetFile(filename_stats)

# Print general file metadata (optional)
print("File Metadata:\n", parquet_file.metadata)

# Loop through each rowgroup
for i in range(parquet_file.num_row_groups):
    row_group = parquet_file.metadata.row_group(i)

    # Print rowgroup metadata
    print(f"\nRow Group {i} Metadata:")
    print(row_group)

    # Access statistics for each column in the rowgroup
    for j in range(row_group.num_columns):
        column = row_group.column(j)
        print(f"Column {j}: {column.path_in_schema}")
        print("  Statistics:")
        print("    Min:", column.statistics.min)
        print("    Max:", column.statistics.max)
        # Add additional statistics as needed

Log output

File Metadata:
 <pyarrow._parquet.FileMetaData object at 0x11bc2aa20>
  created_by: Polars
  num_columns: 1
  num_rows: 1000000
  num_row_groups: 3
  format_version: 2.6
  serialized_size: 512

Row Group 0 Metadata:
<pyarrow._parquet.RowGroupMetaData object at 0x11bbfbd80>
  num_columns: 1
  num_rows: 333333
  total_byte_size: 2708509
Column 0: a
  Statistics:
    Min: None
    Max: None

Row Group 1 Metadata:
<pyarrow._parquet.RowGroupMetaData object at 0x117ac0a90>
  num_columns: 1
  num_rows: 333333
  total_byte_size: 2708509
Column 0: a
  Statistics:
    Min: None
    Max: None

Row Group 2 Metadata:
<pyarrow._parquet.RowGroupMetaData object at 0x11bbfbd80>
  num_columns: 1
  num_rows: 333334
  total_byte_size: 2708517
Column 0: a
  Statistics:
    Min: None
    Max: None

Issue description

No statistics have been written for any row groups in the file

Expected behavior

Statistics for column a in each row group

Installed versions

``` --------Version info--------- Polars: 0.20.3 Index type: UInt32 Platform: macOS-14.2.1-x86_64-i386-64bit Python: 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:41:54) [Clang 13.0.1 ] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 2.0.0 connectorx: 0.3.1 deltalake: 0.9.0 fsspec: 2023.5.0 gevent: hvplot: 0.9.1 matplotlib: 3.7.1 numpy: 1.24.3 openpyxl: pandas: 2.0.1 pyarrow: 13.0.0 pydantic: 1.10.7 pyiceberg: pyxlsb: sqlalchemy: 2.0.12 xlsx2csv: 0.8.1 xlsxwriter: 3.0.3 ```
braaannigan commented 10 months ago

Update: I ran with 0.19.3,0.18.3,0.17.3,0.16.3 but it was only 0.16.3 that produced statistics. This used the pyarrow Parquet writer (note "the created_by: parquet-cpp-arrow version 13.0.0") , here's the output:

File Metadata:
 <pyarrow._parquet.FileMetaData object at 0x1084bb150>
  created_by: parquet-cpp-arrow version 13.0.0
  num_columns: 1
  num_rows: 1000000
  num_row_groups: 1
  format_version: 2.6
  serialized_size: 392

Row Group 0 Metadata:
<pyarrow._parquet.RowGroupMetaData object at 0x10a2cbc40>
  num_columns: 1
  num_rows: 1000000
  total_byte_size: 8279417
Column 0: a
  Statistics:
    Min: 0
    Max: 999999
ritchie46 commented 10 months ago

There are statistics. Pyarrow shows a deprecated version of parquet statistics.

deanm0000 commented 10 months ago

I brought this up before, here.

See this quote. Basically the reason that pyarrow doesn't use stats as written by polars is that polars doesn't write the column_order metadata which is part of the parquet spec. The spec says if that is missing then don't attempt to use the stats.

Here are code comments saying the same

This is a bug report to pyarrow calling them out for using the deprecated stats and me getting schooled (as the kids say)

In the absence of anyone else bringing this up or noticing, I've just been using pyarrow writer.

deanm0000 commented 10 months ago

@ritchie46

I might be on a wild goose chase here but in the parquet-tools dump of the pyarrow saved file, the column_orders shows up right after the created_by and it doesn't show up at all in the polars saved parquet. I decided to poke around polars-parquet and I found these lines in the polars parquet/metadata

https://github.com/pola-rs/polars/blob/b27fe9459a4b11e9dd267adba34c2a89b298306b/crates/polars-parquet/src/parquet/metadata/file_metadata.rs#L110-L129

So polars knows about column_orders but over here

https://github.com/pola-rs/polars/blob/b27fe9459a4b11e9dd267adba34c2a89b298306b/crates/polars-parquet/src/parquet/write/file.rs#L42-L55

and here

https://github.com/pola-rs/polars/blob/b27fe9459a4b11e9dd267adba34c2a89b298306b/crates/polars-parquet/src/parquet/write/file.rs#L94-L113

Should column_orders be after the created_by in those blocks?

Anyway I'm way over my skis here so hopefully you can take another look.

ritchie46 commented 10 months ago

Hmm.. Ok, you might be onto something @deanm0000. Also learning about parquet as we go. ;)