[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
import polars as pl
import pandas as pd
import pyarrow.parquet as pq
# download yellow_tripdata_2024-01.parquet from below
# https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
# https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet
df = pl.read_parquet("yellow_tripdata_2024-01.parquet")
df.write_parquet("yellow_tripdata_2024-01.polars.parquet", compression="zstd")
pl.read_parquet("yellow_tripdata_2024-01.parquet").write_parquet("yellow_tripdata_2024-01.polars.pyarrow.parquet", compression="zstd", use_pyarrow=True)
pd.read_parquet("yellow_tripdata_2024-01.parquet").to_parquet("yellow_tripdata_2024-01.pandas.parquet", compression="zstd")
df = pq.read_table("yellow_tripdata_2024-01.parquet")
pq.write_table(df, "yellow_tripdata_2024-01.pyarrow.parquet", compression="zstd")
Log output
No response
Issue description
NOTE: I admit that this isn't a bug as the documentation makes no claim of equivalence of write_parquet to pyarrow.
I noticed that parquet files written with polars were much larger--sometimes 60% larger--than those written with pandas. I explored a little and found that setting use_pyarrow=True produces similar file sizes to pandas and pyarrow, which is not surprising.
I chose the yellow cab taxi dataset to demonstrate, which only produces a file ~25% larger. So the sizes are inconsistent.
I've noticed a lot of variation in file size. For example, using Int32 instead of Int64 produces a larger file.
Directory output follows after running the code. Note yellow_tripdata_2024-01.polars.parquet is ~25% larger than others.
49977253 May 14 17:33 yellow_tripdata_2024-01.pandas.parquet
49961641 May 14 17:24 yellow_tripdata_2024-01.parquet
63199943 May 14 17:40 yellow_tripdata_2024-01.polars.parquet
52190307 May 14 17:34 yellow_tripdata_2024-01.polars.pyarrow.parquet
49970699 May 14 17:38 yellow_tripdata_2024-01.pyarrow.parquet
Expected behavior
One might expect the resulting parquet file to be similar enough regardless of how one writes it.
Checks
Reproducible example
Log output
No response
Issue description
NOTE: I admit that this isn't a bug as the documentation makes no claim of equivalence of write_parquet to pyarrow.
I noticed that parquet files written with polars were much larger--sometimes 60% larger--than those written with pandas. I explored a little and found that setting use_pyarrow=True produces similar file sizes to pandas and pyarrow, which is not surprising.
I chose the yellow cab taxi dataset to demonstrate, which only produces a file ~25% larger. So the sizes are inconsistent.
I've noticed a lot of variation in file size. For example, using Int32 instead of Int64 produces a larger file.
Directory output follows after running the code. Note
yellow_tripdata_2024-01.polars.parquet
is ~25% larger than others.Expected behavior
One might expect the resulting parquet file to be similar enough regardless of how one writes it.
Installed versions