Open cboettig opened 1 year ago
How does this query run on pyarrow datasets if you pass that filter? If that also OOMs, it is not on our side.
@ritchie46 Thanks for the quick reply. Just a note that the above is using a public bucket, if you have a python terminal handy I think the behavior should be reproducible. At the moment I can't get it to crash, it just seems to hang.
If I omit the filter I see the same behavior -- the command seems to just be stalled on the collect()
call, nothing is happening -- no CPU use, no network transfer.
No OOM errors are involved when reading the entire dataset either. If I avoid any use of polars
and just use the pyarrow dataset.to_table()
method, things run as expected. Please let me know if I can provide any more information.
Thanks for all you do, polars
is amazing.
I see what happens. We pickle the pyarrow dataset, but somehow when we call pickle.loads(dataset)
nothing happens. This seems to be an error on pyarrows pickle implementation.
I refactored out the pickle, but still it is very slow on the pyarrow side.
filter = (pa.compute.field('variable') == 'TMP'
from_arrow(ds.to_table(columns=with_columns, filter=_filter)) # type: ignore[return-value]
never finishes on my side.
Polars version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.
Issue description
I have a large-ish partitioned parquet dataset that I frequently access and subset using
arrow
. I understand that with Polarsscan_pyarrow_dataset
I should be able to leverage polars syntax here, but instead an operation that takes seconds on other systems just seems to stall out or crash here.Reproducible example
Expected behavior
Here's what I think is very analogous R code that does the same operation using arrow+dplyr on the same data:
on my machine this takes ~ 12sec or so.
Installed versions