pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.25k stars 1.85k forks source link

Performance degradation using Rust-native parquet reader from AWS S3 for a dataframe with 12,000 columns. #18443

Open vstolin opened 2 weeks ago

vstolin commented 2 weeks ago

Checks

Reproducible example

Import boto3
import os
import numpy as np
import polars as pl
import pyarrow.dataset as ds
import pyarrow.fs as fs
import pyarrow.parquet as pq

# create and store dataframe on AWS S3
large_num_cols = 12_000
df = pl.DataFrame({f"col_{i}": np.random.normal(loc=0.0, scale=1.0, size=24_000) for i in range(large_num_cols)})
os.environ["AWS_PROFILE"] = "my_profile"
pyfs = fs.S3FileSystem(retry_strategy=fs.AwsStandardS3RetryStrategy(max_attempts=10))

df.write_parquet(
    file="parquet-writers-test/test_large_df.pq",
    compression="lz4",
    use_pyarrow=True,
    pyarrow_options={"filesystem": pyfs},
)

# read parquet with Pyarrow – around 6 seconds
df1 = pl.read_parquet(
    source="parquet-writers-test/test_large_df.pq",
    use_pyarrow=True,
    pyarrow_options={"filesystem": pyfs},
)

# read parquet with Rust-native – around 50 seconds
session = boto3.Session(profile_name="my_profile")
credentials = session.get_credentials().get_frozen_credentials()
storage_options = {
    "aws_access_key_id": credentials.access_key,
    "aws_secret_access_key": credentials.secret_key,
    "aws_session_token": credentials.token,
    "aws_region": session.region_name,
}
source = "s3://parquet-writers-test/test_large_df.pq"
df2 = pl.scan_parquet(source, storage_options=storage_options)

# read scan parquet dataset with Pyarrow – around 6 seconds
pyds = ds.dataset(
    source="parquet-writers-test/test_large_df.pq",
    filesystem=pyfs,
    format="parquet",
)
df3 = pl.scan_pyarrow_dataset(pyds).collect()

# scan parquet with Rust-native
df4 = pl.scan_parquet(source, storage_options=storage_options).collect()

Log output

No response

Issue description

We observed a significant degradation in speed reading parquet with 12,000 columns file from AWS S3 bucket when using Rust-native parquet reader in comparison to Pyarrow native implementation:

read_parquet (with Pyarrow filesystem) – around 6 seconds read_parquet (with Rust-native) – around 50 seconds scan_pyarrow_dataset (with Pyarrow filesystem) – around 6 seconds scan_parquet (with Rust-native) – 100 seconds

Expected behavior

This would be less of an issue if scan_parquet allowed to use Pyarrow filesystem, similar to read_parquet and scan_pyarrow_dataset

Installed versions

``` --------Version info--------- Polars: 1.5.0 Index type: UInt32 Platform: Linux-4.18.0-513.18.1.el8_9.x86_64-x86_64-with-glibc2.28 Python: 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 2.2.1 connectorx: deltalake: 0.18.2 fastexcel: fsspec: 2024.6.1 gevent: great_tables: 0.10.0 hvplot: 0.10.0 matplotlib: 3.9.1 nest_asyncio: 1.6.0 numpy: 1.24.4 openpyxl: 3.1.5 pandas: 2.1.4 pyarrow: 16.1.0 pydantic: 2.8.2 pyiceberg: sqlalchemy: 2.0.32 torch: 2.3.1.post100 xlsx2csv: xlsxwriter: ```
cmdlineluser commented 2 weeks ago

1.6 was just released which contains a fix for:

It sounds like it could be the same issue you're describing.

ritchie46 commented 2 weeks ago

This and we will improve it even more as there are still a few places where we are linear when we can be O(1).

vstolin commented 2 weeks ago

Hi @cmdlineluser, thanks for pointing out to the existing issue and new Polars version. Hi @ritchie46, as always very much appreciate your commitment to timely address issues and keep Polars best in class - it's really great to be part of this community!

I upgraded to version 1.6 and definitely see the improvement:

read_parquet (with Rust-native) – in Polars 1.6.0 takes 30 seconds versus 50 seconds in version 1.5.0 scan_parquet (with Rust-native) – in Polars 1.6.0 takes 30 seconds versus 100 seconds in version 1.5.0

I'm definitely looking forward to further improvements to Rust-native reader to bring it in line with Pyarrow which is still faster.

@ritchie46 are there plans to make scan_parquet to accept optional Pyarrow filesystem or is it design decision to support Rust-native only?

Thank you!

ritchie46 commented 2 weeks ago

We will not plan to take pyarrow file system in our native readers. We do support pyarrow datasets as scan functions.

The performance of very wide parquet files will further improve by @nameexhaustion's upcoming schema unification and metadata supertype`. This issue is on our roadmap.