Open lucazanna opened 1 year ago
I'd very much like to add my voice in supporting this especially since adding hive-partitioned columns is not supported in scan_parquet or scan_csv methods
This might be a separate issue but seems related: Polars doesn't yet support predicate pushdown into datasets nor streaming output. It eagerly calls to_table()
on them instead. Both DataFusion and DuckDB can query datasets lazily with predicate pushdown; it would be nice if Polars did too.
I did a write up about how some of this works and how deltalake integrates with this: https://docs.google.com/document/d/1XGg1pf9Nep9GHlSdvO65Ao1kyQ_Z_g55uyHuTYVyeT0/edit#
cc @chitralverma this might be interesting to you as well.
Thanks @wjones127 . I believe some work was done to allow limit, predicate and projection push downs to the pyarrow datasets. Not sure if this is complete though.
Problem description
I wish I could use the streaming and sink functionalities when scanning a PyArrow dataset, in order to scan a Parquet from Azure.
The streaming and sink functionalities do not seem to work when scanning a PyArrow dataset.
Example: