Open ion-elgreco opened 5 months ago
I tried to reproduce with the 100m but after 2 min of generating df, I tapped out and did it again with just 10m. With just 10m rows, I got 2.9s to save with polars and 3.0s with pyarrow.
By using "zstd" as compression method i got this (with 10m rows) 4.85 s ± 303 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 8.6 s ± 1.41 s per loop (mean ± std. dev. of 7 runs, 1 loop each) Where i set use_pyarrow=True for the first part
I'm also running into this. Writing a dataset of 50 million rows to disk. Ends up being about 2GB on disk. It takes 1 minute with use_pyarrow=True
and 6 minutes without the flag.
Checks
Reproducible example
df.write_parquet("test.parquet", compression='snappy')
takes 92 secondsdf.write_parquet("test2.parquet", compression='snappy', use_pyarrow=True)
takes 55 seconds.Log output
No response
Issue description
At work we saw one of our pipelines taking around 50 minutes to write a parquet file. The difference was huge compared to pyarrow which took only one minute, see the logs below:
With polars (50minutes):
With pyarrow (1.5 minute):
Expected behavior
Write fast, like pyarrow does.
Installed versions