Open vstolin opened 2 weeks ago
1.6 was just released which contains a fix for:
It sounds like it could be the same issue you're describing.
This and we will improve it even more as there are still a few places where we are linear when we can be O(1).
Hi @cmdlineluser, thanks for pointing out to the existing issue and new Polars version. Hi @ritchie46, as always very much appreciate your commitment to timely address issues and keep Polars best in class - it's really great to be part of this community!
I upgraded to version 1.6 and definitely see the improvement:
read_parquet (with Rust-native) – in Polars 1.6.0 takes 30 seconds versus 50 seconds in version 1.5.0 scan_parquet (with Rust-native) – in Polars 1.6.0 takes 30 seconds versus 100 seconds in version 1.5.0
I'm definitely looking forward to further improvements to Rust-native reader to bring it in line with Pyarrow which is still faster.
@ritchie46 are there plans to make scan_parquet to accept optional Pyarrow filesystem or is it design decision to support Rust-native only?
Thank you!
We will not plan to take pyarrow file system in our native readers. We do support pyarrow datasets as scan
functions.
The performance of very wide parquet files will further improve by @nameexhaustion's upcoming schema
unification and metadata
supertype`. This issue is on our roadmap.
Checks
Reproducible example
Log output
No response
Issue description
We observed a significant degradation in speed reading parquet with 12,000 columns file from AWS S3 bucket when using Rust-native parquet reader in comparison to Pyarrow native implementation:
read_parquet (with Pyarrow filesystem) – around 6 seconds read_parquet (with Rust-native) – around 50 seconds scan_pyarrow_dataset (with Pyarrow filesystem) – around 6 seconds scan_parquet (with Rust-native) – 100 seconds
Expected behavior
This would be less of an issue if scan_parquet allowed to use Pyarrow filesystem, similar to read_parquet and scan_pyarrow_dataset
Installed versions