Closed braaannigan closed 10 months ago
Update: I ran with 0.19.3,0.18.3,0.17.3,0.16.3 but it was only 0.16.3 that produced statistics. This used the pyarrow Parquet writer (note "the created_by: parquet-cpp-arrow version 13.0.0") , here's the output:
File Metadata:
<pyarrow._parquet.FileMetaData object at 0x1084bb150>
created_by: parquet-cpp-arrow version 13.0.0
num_columns: 1
num_rows: 1000000
num_row_groups: 1
format_version: 2.6
serialized_size: 392
Row Group 0 Metadata:
<pyarrow._parquet.RowGroupMetaData object at 0x10a2cbc40>
num_columns: 1
num_rows: 1000000
total_byte_size: 8279417
Column 0: a
Statistics:
Min: 0
Max: 999999
There are statistics. Pyarrow shows a deprecated version of parquet statistics.
I brought this up before, here.
See this quote. Basically the reason that pyarrow doesn't use stats as written by polars is that polars doesn't write the column_order metadata which is part of the parquet spec. The spec says if that is missing then don't attempt to use the stats.
Here are code comments saying the same
This is a bug report to pyarrow calling them out for using the deprecated stats and me getting schooled (as the kids say)
In the absence of anyone else bringing this up or noticing, I've just been using pyarrow writer.
@ritchie46
I might be on a wild goose chase here but in the parquet-tools dump of the pyarrow saved file, the column_orders shows up right after the created_by and it doesn't show up at all in the polars saved parquet. I decided to poke around polars-parquet and I found these lines in the polars parquet/metadata
So polars knows about column_orders but over here
and here
Should column_orders
be after the created_by
in those blocks?
Anyway I'm way over my skis here so hopefully you can take another look.
Hmm.. Ok, you might be onto something @deanm0000. Also learning about parquet as we go. ;)
Checks
Reproducible example
Log output
Issue description
No statistics have been written for any row groups in the file
Expected behavior
Statistics for column a in each row group
Installed versions