pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.87k stars 1.92k forks source link

Allow custom lazy scanning from python #4351

Open OneRaynyDay opened 2 years ago

OneRaynyDay commented 2 years ago

Describe your feature request

My understanding of the polars query planner is modest at best, so please correct me if I'm (ab)using the terms used below or if I have a conceptual misunderstanding:

pl.scan_parquet is great for creating a pl.LazyFrame from a large collection of parquet files and push down predicates/filters/projections/etc. Polars also support a fairly wide selection of reads/scans from other data sources. Is it possible to allow someone to pass in a python function that takes in predicates/filters/projections/etc and returns a pl.LazyFrame? We would consider it a part of a query plan and is essentially a backdoor to augmenting the query with extra optimizations, extra data source reads, extra asserts, etc.

If I'm not mistaken, this particular PythonScanExec actually allows you to call an arbitrary function PyBytes::new(py, &self.options.scan_fn) and get the output. It seems like it only takes in the columns but not the predicates/filters/projections/etc, and I'm not sure if this interface is exposed in python. Would we be able to build on top of this?

ritchie46 commented 2 years ago

Yes, you can use that node. This file is an example. There we create a scan_ds and a scan_parquet_fsspec function.

We push projections to the node and pass that as a with_columns: list[str] argument. The predicates are pushed down to after that node. I am not really sure how we can pass a predicate to the function itself and what benefit does it have.

What I mean by that is that predicates internally are represented as a virtual function that produces a boolean mask. Not something that's usable on the python side. And if we pass the predicate as expression then you have to apply that predicate in the function itself and that does not seem to have much benefit to us applying that predicate when that function is finished.

universalmind303 commented 2 years ago

This is already supported in Rust via AnonymousScan trait. This trait allows for predicate, projection & slice pushdowns. We could potentially look at exposing a python interface to interact with that trait?

Alternatively, You could even create your own extension library, with rust bindings that leverages the trait.

this is an example of such an extension. https://github.com/universalmind303/polars-mongo

OneRaynyDay commented 2 years ago

Yeah I think having a python interface would be awesome! Currently the effort to build rust at my company is nascent at best, so having a python fn drop-in would require the least amount of hoops to jump

tmct commented 1 month ago

This functionality through _scan_python_function looks very useful for defining custom sources of LazyFrames, I'll try it, thanks.

If it works well, it would be nice for this to become an official (non-private) extensibility point in the same way that register_lazyframe_namespace is - both feel like useful ways to try out a piece of functionality for fit in Python before writing a whole Rust plugin.

Does this method of defining a LazyFrame allow for a lazy schema definition, as can be used with the recent collect_schema()?

Thanks

cmdlineluser commented 1 month ago

I may be mistaken, but IO Plugins were recently added - do they solve this?

Relevant test that was added:

tmct commented 1 month ago

I didn't know about those! Will have to have a look, thanks

tmct commented 1 month ago

That does indeed look extremely promising, most notably that it's more of a public API. I'll use those. Thanks again

It looks like the "lazy schema" isn't ready yet - do you know if there is an issue to track that?

# TODO: make lazy via callable

tmct commented 1 month ago

And - sorry for my unfamiliarity - do you know what it means for the iterator to return multiple DataFrames? ~Is this for a scenario where scan_my_source returns multiple LazyFrames in a tuple?~

My best guess is that the DataFrames it returns would be vertically concatenated into the result... Not least because there is a batch_size arg. ...and maybe that's for streaming!

tmct commented 1 month ago

IO Plugins do indeed seem to be the way forward - thanks for the pointer! The seem to work well for my use case.

Couple of issues:

Therefore I've raised a new Issue about those, particularly the first: #18638