pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.64k stars 1.99k forks source link

Performance regression scanning parquet from S3 #12100

Open tikkanz opened 1 year ago

tikkanz commented 1 year ago

Checks

Reproducible example

df = (
    pl.scan_parquet(
        "s3://my-bucket/my/keys/File_prefix_*_suffix.parquet",
        storage_options={"region": "ap-southeast-2"},
    )
    .filter(pl.col("is_big")).filter(pl.col("item_count") > 40)
    .select(pl.col("ID", "value", "value-accuracy", "item_count", "is_big", "is_wide"))
    .sort(by="ID")
)
print(df.explain(optimized=True))
df.profile()

0.19.11

SORT BY [col("ID")]
  FAST_PROJECT: [ID, value, value-accuracy, item_count, is_big, is_wide]

      Parquet SCAN 4 files: first file: s3://my-bucket/my/keys/File_prefix_big_wide_results.parquet
      PROJECT 6/23 COLUMNS
      SELECTION: [(col("is_big")) & ([(col("item_count")) > (40)])]

(shape: (15_808, 6)
 ...,
 shape: (4, 3)
 ┌───────────────────────────────────┬─────────┬─────────┐
 │ node                              ┆ start   ┆ end     │
 │ ---                               ┆ ---     ┆ ---     │
 │ str                               ┆ u64     ┆ u64     │
 ╞═══════════════════════════════════╪═════════╪═════════╡
 │ optimization                      ┆ 0       ┆ 12      │
 │ parquet(s3://my-bucket/my/keys/F… ┆ 12      ┆ 7410059 │
 │ FAST_PROJECT: [lD, value, value-… ┆ 7410064 ┆ 7410098 │
 │ sort(ID)                          ┆ 7410109 ┆ 7413900 │
 └───────────────────────────────────┴─────────┴─────────┘)

0.19.12

SORT BY [col("ID")]
  FAST_PROJECT: [ID, value, value-accuracy, item_count, is_big, is_wide]

      Parquet SCAN 4 files: first file: s3://my-bucket/my/keys/File_prefix_big_wide_results.parquet
      PROJECT 6/23 COLUMNS
      SELECTION: [([(col("item_count")) > (40)]) & (col("is_big"))]

(shape: (15_808, 6)
...,
 shape: (4, 3)
 ┌───────────────────────────────────┬──────────┬──────────┐
 │ node                              ┆ start    ┆ end      │
 │ ---                               ┆ ---      ┆ ---      │
 │ str                               ┆ u64      ┆ u64      │
 ╞═══════════════════════════════════╪══════════╪══════════╡
 │ optimization                      ┆ 0        ┆ 6        │
 │ parquet(s3://my-bucket/my/keys/F… ┆ 6        ┆ 35188017 │
 │ FAST_PROJECT: [ID, value, value-… ┆ 35188023 ┆ 35188045 │
 │ sort(D)                           ┆ 35188053 ┆ 35189145 │
 └───────────────────────────────────┴──────────┴──────────┘

Log output

No response

Issue description

Between 0.19.11 and 0.19.12 there have been some changes in the reading of parquet from S3. These change the optimization plan slightly and increase the time to execute the operation above from ~6 seconds to ~25 seconds.

Expected behavior

There should not be a large increase in the time to execute.

Installed versions

``` --------Version info--------- Polars: 0.19.12 Index type: UInt32 Platform: Linux-5.15.0-1043-aws-x86_64-with-glibc2.31 Python: 3.11.5 (main, Aug 25 2023, 13:19:53) [GCC 9.4.0] ----Optional dependencies---- adbc_driver_sqlite: cloudpickle: connectorx: deltalake: fsspec: gevent: matplotlib: numpy: openpyxl: pandas: pyarrow: pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```
ritchie46 commented 1 year ago

Could you share the files/datasets?

tikkanz commented 1 year ago

For context the original 4 files are each ~34 million rows and ~1.4GB. I can't share them, but I can have a go at creating some synthetic data if needed.

ritchie46 commented 1 year ago

Please. Things that are important are the schema and the number of rows and row groups and compression method

ritchie46 commented 1 year ago

@tikkanz can you try polars-0.19.13-rc1?

tikkanz commented 1 year ago

How do I go about testing it? Is there a wheel I can download from somewhere or do I need to compile from source? If so do do I need to choose the make build-release compilation option to ensure a good comparison?

ritchie46 commented 1 year ago

pip install polars-0.19.13-rc1

tikkanz commented 1 year ago

I hope it is that simple, but when I try that I get the following:

(testvenv) prompt:~$ pip show polars
Name: polars
Version: 0.19.12
Summary: Blazingly fast DataFrame library
Home-page:
Author:
Author-email: Ritchie Vink <ritchie46@gmail.com>
License:
Location: /opt/markermodel/testvenv/lib/python3.11/site-packages
Requires:
Required-by:
(testvenv) prompt:~$ pip install polars-0.19.13-rc1
ERROR: Could not find a version that satisfies the requirement polars-0.19.13-rc1 (from versions: none)
ERROR: No matching distribution found for polars-0.19.13-rc1

What am I missing?

tikkanz commented 1 year ago

@ritchie46 yesterday, prior to your message above , I tried compiling from source and think I managed to get things working. Running the command with the release candidate is definitely an improvement. Timings now look like:

print(pl.__version__)
df = (
    pl.scan_parquet(
        "s3://my-bucket/my/keys/File_prefix_*_suffix.parquet",
        storage_options={"region": "ap-southeast-2"},
    )
    .filter(pl.col("is_big")).filter(pl.col("item_count") > 40)
    .select(pl.col("ID", "value", "value-accuracy", "item_count", "is_big", "is_wide"))
    .sort(by="ID")
)
print(df.explain(optimized=True))
df.profile()
0.19.13-rc.1
SORT BY [col("ID")]
  FAST_PROJECT: [ID, value, value-accuracy, item_count, is_big, is_wide]

      Parquet SCAN 4 files: first file: s3://my-bucket/my/keys/File_prefix_*_suffix.parquet
      PROJECT 6/23 COLUMNS
      SELECTION: [([(col("item_count")) > (40)]) & (col("is_big"))]
(shape: (15_808, 6)
 ...,
 shape: (4, 3)
 ┌───────────────────────────────────┬─────────┬─────────┐
 │ node                              ┆ start   ┆ end     │
 │ ---                               ┆ ---     ┆ ---     │
 │ str                               ┆ u64     ┆ u64     │
 ╞═══════════════════════════════════╪═════════╪═════════╡
 │ optimization                      ┆ 0       ┆ 5       │
 │ parquet(s3://my-bucket/my/keys/F… ┆ 5       ┆ 7374446 │
 │ FAST_PROJECT: [lD, value, value_… ┆ 7374452 ┆ 7374463 │
 │ sort(lD)                          ┆ 7374469 ┆ 7375459 │
 └───────────────────────────────────┴─────────┴─────────┘)