Polars parquet writer much slower than pyarrow parquet writer

ion-elgreco commented 5 months ago

Checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import numpy as np
import polars as pl

df = pl.DataFrame({
    "foo":np.random.randn(1, 100000000).reshape((100000000,)),
    "foo1":np.random.randn(1, 100000000).reshape((100000000,)),
    "foo2":np.random.randn(1, 100000000).reshape((100000000,)),
    "foo3":np.random.randn(1, 100000000).reshape((100000000,))
})
df = df.with_columns(pl.col('foo').cast(pl.Utf8).alias('foo_str'), pl.col('foo').cast(pl.Utf8).alias('foo_str2'))

df.write_parquet("test.parquet", compression='snappy') takes 92 seconds

df.write_parquet("test2.parquet", compression='snappy', use_pyarrow=True) takes 55 seconds.

Log output

No response

Issue description

At work we saw one of our pipelines taking around 50 minutes to write a parquet file. The difference was huge compared to pyarrow which took only one minute, see the logs below:

With polars (50minutes):

With pyarrow (1.5 minute):

Expected behavior

Write fast, like pyarrow does.

Installed versions

``` 0.20.10 ```

deanm0000 commented 5 months ago

I tried to reproduce with the 100m but after 2 min of generating df, I tapped out and did it again with just 10m. With just 10m rows, I got 2.9s to save with polars and 3.0s with pyarrow.

Chuck321123 commented 5 months ago

By using "zstd" as compression method i got this (with 10m rows) 4.85 s ± 303 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 8.6 s ± 1.41 s per loop (mean ± std. dev. of 7 runs, 1 loop each) Where i set use_pyarrow=True for the first part

RmStorm commented 3 months ago

I'm also running into this. Writing a dataset of 50 million rows to disk. Ends up being about 2GB on disk. It takes 1 minute with use_pyarrow=True and 6 minutes without the flag.

pola-rs / polars