pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
26.9k stars 1.65k forks source link

polars.Dataframe.write_parquet() produces a larger parquet file than use_pyarrow, pandas, or pyarrow #16238

Closed mesner closed 2 weeks ago

mesner commented 2 weeks ago

Checks

Reproducible example

import polars as pl
import pandas as pd
import pyarrow.parquet as pq

# download yellow_tripdata_2024-01.parquet from below
# https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
# https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet

df = pl.read_parquet("yellow_tripdata_2024-01.parquet")
df.write_parquet("yellow_tripdata_2024-01.polars.parquet", compression="zstd")

pl.read_parquet("yellow_tripdata_2024-01.parquet").write_parquet("yellow_tripdata_2024-01.polars.pyarrow.parquet", compression="zstd", use_pyarrow=True)

pd.read_parquet("yellow_tripdata_2024-01.parquet").to_parquet("yellow_tripdata_2024-01.pandas.parquet", compression="zstd")

df = pq.read_table("yellow_tripdata_2024-01.parquet")
pq.write_table(df, "yellow_tripdata_2024-01.pyarrow.parquet", compression="zstd")

Log output

No response

Issue description

NOTE: I admit that this isn't a bug as the documentation makes no claim of equivalence of write_parquet to pyarrow.

I noticed that parquet files written with polars were much larger--sometimes 60% larger--than those written with pandas. I explored a little and found that setting use_pyarrow=True produces similar file sizes to pandas and pyarrow, which is not surprising.

I chose the yellow cab taxi dataset to demonstrate, which only produces a file ~25% larger. So the sizes are inconsistent.

I've noticed a lot of variation in file size. For example, using Int32 instead of Int64 produces a larger file.

Directory output follows after running the code. Note yellow_tripdata_2024-01.polars.parquet is ~25% larger than others.

49977253 May 14 17:33 yellow_tripdata_2024-01.pandas.parquet
49961641 May 14 17:24 yellow_tripdata_2024-01.parquet
63199943 May 14 17:40 yellow_tripdata_2024-01.polars.parquet
52190307 May 14 17:34 yellow_tripdata_2024-01.polars.pyarrow.parquet
49970699 May 14 17:38 yellow_tripdata_2024-01.pyarrow.parquet

Expected behavior

One might expect the resulting parquet file to be similar enough regardless of how one writes it.

Installed versions

``` --------Version info--------- Polars: 0.20.26 Index type: UInt32 Platform: Windows-10-10.0.22631-SP0 Python: 3.11.4 | packaged by Anaconda, Inc. | (main, Jul 5 2023, 13:47:18) [MSC v.1916 64 bit (AMD64)] ----Optional dependencies---- adbc_driver_manager: 0.7.0 cloudpickle: 3.0.0 connectorx: 0.3.2 deltalake: 0.14.0 fastexcel: fsspec: 2023.4.0 gevent: 23.9.1 hvplot: matplotlib: 3.8.0 nest_asyncio: 1.5.8 numpy: 1.24.1 openpyxl: 3.1.2 pandas: 2.2.2 pyarrow: 14.0.1 pydantic: 2.4.2 pyiceberg: 0.5.0 pyxlsb: sqlalchemy: 2.0.22 torch: 2.1.0+cu121 xlsx2csv: 0.8.1 xlsxwriter: 3.1.8 ```
owenprough-sift commented 2 weeks ago

Probably related to the discussion in #15959. And possibly a duplicate of #10680?

mesner commented 2 weeks ago

Yes, likely. Sorry, I didn't see #10680