Open tikkanz opened 1 year ago
Could you share the files/datasets?
For context the original 4 files are each ~34 million rows and ~1.4GB. I can't share them, but I can have a go at creating some synthetic data if needed.
Please. Things that are important are the schema and the number of rows and row groups and compression method
@tikkanz can you try polars-0.19.13-rc1
?
How do I go about testing it?
Is there a wheel I can download from somewhere or do I need to compile from source? If so do do I need to choose the make build-release
compilation option to ensure a good comparison?
pip install polars-0.19.13-rc1
I hope it is that simple, but when I try that I get the following:
(testvenv) prompt:~$ pip show polars
Name: polars
Version: 0.19.12
Summary: Blazingly fast DataFrame library
Home-page:
Author:
Author-email: Ritchie Vink <ritchie46@gmail.com>
License:
Location: /opt/markermodel/testvenv/lib/python3.11/site-packages
Requires:
Required-by:
(testvenv) prompt:~$ pip install polars-0.19.13-rc1
ERROR: Could not find a version that satisfies the requirement polars-0.19.13-rc1 (from versions: none)
ERROR: No matching distribution found for polars-0.19.13-rc1
What am I missing?
@ritchie46 yesterday, prior to your message above , I tried compiling from source and think I managed to get things working. Running the command with the release candidate is definitely an improvement. Timings now look like:
print(pl.__version__)
df = (
pl.scan_parquet(
"s3://my-bucket/my/keys/File_prefix_*_suffix.parquet",
storage_options={"region": "ap-southeast-2"},
)
.filter(pl.col("is_big")).filter(pl.col("item_count") > 40)
.select(pl.col("ID", "value", "value-accuracy", "item_count", "is_big", "is_wide"))
.sort(by="ID")
)
print(df.explain(optimized=True))
df.profile()
0.19.13-rc.1
SORT BY [col("ID")]
FAST_PROJECT: [ID, value, value-accuracy, item_count, is_big, is_wide]
Parquet SCAN 4 files: first file: s3://my-bucket/my/keys/File_prefix_*_suffix.parquet
PROJECT 6/23 COLUMNS
SELECTION: [([(col("item_count")) > (40)]) & (col("is_big"))]
(shape: (15_808, 6)
...,
shape: (4, 3)
┌───────────────────────────────────┬─────────┬─────────┐
│ node ┆ start ┆ end │
│ --- ┆ --- ┆ --- │
│ str ┆ u64 ┆ u64 │
╞═══════════════════════════════════╪═════════╪═════════╡
│ optimization ┆ 0 ┆ 5 │
│ parquet(s3://my-bucket/my/keys/F… ┆ 5 ┆ 7374446 │
│ FAST_PROJECT: [lD, value, value_… ┆ 7374452 ┆ 7374463 │
│ sort(lD) ┆ 7374469 ┆ 7375459 │
└───────────────────────────────────┴─────────┴─────────┘)
Checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
0.19.11
0.19.12
Log output
No response
Issue description
Between 0.19.11 and 0.19.12 there have been some changes in the reading of parquet from S3. These change the optimization plan slightly and increase the time to execute the operation above from ~6 seconds to ~25 seconds.
Expected behavior
There should not be a large increase in the time to execute.
Installed versions