unexpected performance / crashes with arrow+remote parquet

cboettig commented 1 year ago

Polars version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.

Issue description

I have a large-ish partitioned parquet dataset that I frequently access and subset using arrow. I understand that with Polars scan_pyarrow_dataset I should be able to leverage polars syntax here, but instead an operation that takes seconds on other systems just seems to stall out or crash here.

Reproducible example

import polars as pl
import pyarrow.dataset as ds
uri = 's3://bio230014-bucket01/neon4cast-drivers/noaa/gefs-v12/stage1/reference_datetime=2020-09-24?endpoint_override=https://sdsc.osn.xsede.org'

dataset = ds.dataset(uri)
df = pl.scan_pyarrow_dataset(dataset)
df.filter(pl.col("variable")=="TMP").collect()

Expected behavior

Here's what I think is very analogous R code that does the same operation using arrow+dplyr on the same data:

library(arrow)
library(dplyr)
uri <- 's3://bio230014-bucket01/neon4cast-drivers/noaa/gefs-v12/stage1/reference_datetime=2020-09-24?endpoint_override=https://sdsc.osn.xsede.org'

ex <- arrow::open_dataset(uri)
tmp <- ex |>
  filter(variable == "TMP") |>
  collect()

on my machine this takes ~ 12sec or so.

Installed versions

``` --------Version info--------- Polars: 0.18.3 Index type: UInt32 Platform: Linux-5.17.15-76051715-generic-x86_64-with-glibc2.35 Python: 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0] ----Optional dependencies---- numpy: 1.24.3 pandas: 1.5.3 pyarrow: 12.0.0 connectorx: deltalake: fsspec: 2023.5.0 matplotlib: 3.7.1 xlsx2csv: xlsxwriter: ```

ritchie46 commented 1 year ago

How does this query run on pyarrow datasets if you pass that filter? If that also OOMs, it is not on our side.

cboettig commented 1 year ago

@ritchie46 Thanks for the quick reply. Just a note that the above is using a public bucket, if you have a python terminal handy I think the behavior should be reproducible. At the moment I can't get it to crash, it just seems to hang.

If I omit the filter I see the same behavior -- the command seems to just be stalled on the collect() call, nothing is happening -- no CPU use, no network transfer.

No OOM errors are involved when reading the entire dataset either. If I avoid any use of polars and just use the pyarrow dataset.to_table() method, things run as expected. Please let me know if I can provide any more information.

Thanks for all you do, polars is amazing.

ritchie46 commented 1 year ago

I see what happens. We pickle the pyarrow dataset, but somehow when we call pickle.loads(dataset) nothing happens. This seems to be an error on pyarrows pickle implementation.

ritchie46 commented 1 year ago

I refactored out the pickle, but still it is very slow on the pyarrow side.

filter = (pa.compute.field('variable') == 'TMP'
from_arrow(ds.to_table(columns=with_columns, filter=_filter))  # type: ignore[return-value]

never finishes on my side.

pola-rs / polars