Allow streaming when scanning a PyArrow dataset - scan from cloud data lake

lucazanna commented 1 year ago

Problem description

I wish I could use the streaming and sink functionalities when scanning a PyArrow dataset, in order to scan a Parquet from Azure.

The streaming and sink functionalities do not seem to work when scanning a PyArrow dataset.

Example:

import polars as pl
import pyarrow as pa
import pyarrow.dataset as ds

# Scan a Parquet and sink with Polars [all streaming in Polars] -> works
pl.scan_parquet('data1.parquet').sink_parquet('data2.parquet')

# Scan a Parquet and sink with Arrow [all streaming in Arrow] -> works
ds = pa.dataset.dataset('data1.parquet', format='parquet')
pa.dataset.write_dataset(ds.scanner().to_reader(), 'data2.parquet', format="parquet")

# Scan a Pyarrow dataset, then collect and write with Polars -> works
pl.scan_pyarrow_dataset(ds).collect().write_parquet('data2.parquet')

# Scan a Pyarrow dataset, the sink with Polars -> does not work
pl.scan_pyarrow_dataset(ds).sink_parquet('data2.parquet')
# Error: 'sink_parquet not yet supported in standard engine. Use 'collect().write_parquet()'

ddutt commented 1 year ago

I'd very much like to add my voice in supporting this especially since adding hive-partitioned columns is not supported in scan_parquet or scan_csv methods

wjones127 commented 1 year ago

This might be a separate issue but seems related: Polars doesn't yet support predicate pushdown into datasets nor streaming output. It eagerly calls to_table() on them instead. Both DataFusion and DuckDB can query datasets lazily with predicate pushdown; it would be nice if Polars did too.

I did a write up about how some of this works and how deltalake integrates with this: https://docs.google.com/document/d/1XGg1pf9Nep9GHlSdvO65Ao1kyQ_Z_g55uyHuTYVyeT0/edit#

cc @chitralverma this might be interesting to you as well.

chitralverma commented 1 year ago

Thanks @wjones127 . I believe some work was done to allow limit, predicate and projection push downs to the pyarrow datasets. Not sure if this is complete though.

See https://github.com/pola-rs/polars/pull/6734

pola-rs / polars

Allow streaming when scanning a PyArrow dataset - scan from cloud data lake #7750

Problem description