Open CangyuanLi opened 2 weeks ago
I think this is related to #13092, where sink_parquet
creates a large number of small row groups. Looking at the good file, the metadata is
<pyarrow._parquet.FileMetaData object at 0x7f80fc3c1cb0>
created_by: Polars
num_columns: 3002
num_rows: 3759545
num_row_groups: 14
format_version: 2.6
serialized_size: 3470050
After I run
pl.scan_parquet(f).sink_parquet("test.parquet")
the metadata becomes
<pyarrow._parquet.FileMetaData object at 0x7f02ec2f2e30>
created_by: Polars
num_columns: 3002
num_rows: 3759545
num_row_groups: 3752
format_version: 2.6
serialized_size: 1048446714
and I see the poor performance and memory usage pop up.
Checks
Reproducible example
Unfortunately, the data cannot be shared. However, the file is fairly large at 3,604,869 rows and 3,508 columns consisting of a mix of Int64 and String columns. When trying to view the columns of the LazyFrame like below
I get:
Log output
No response
Issue description
Curiously, on another similarly-sized parquet file which is 3,759,545 rows and 3,002 columns, I get
and negligible (maybe 100-200mb) memory usage, which is more along the lines of what I was expecting. As far as I know, the parquet file with poor performance was written out using sink_parquet with default options, so I don't think there is any difference in compression ratio or algorithm to the "normal" parquet file.
I also decided to compare getting the first 500 rows with DuckDB, since I am not sure if getting the columns is a 1:1 operation (I believe DuckDB reads the parquet metadata; I am not entirely sure if Polars does this).
yields
while
yields
Expected behavior
I would expect running
.columns
on a LazyFrame to return almost instantly.Installed versions