Closed ritchie46 closed 11 months ago
Pretty exciting!
Here is an example of a String conversion expression that converts any string to pig latin
Question 1: (apologies in advance if the questions are stupid) it's just not immediately clear from the diff... how would you pass extra arguments to those functions? E.g. you have a compiled expression multiply_by
which you'd want to call like .multiply_by(1)
or .multiply_by(2)
.
(You'd think many expressions, including prebuilt ones, often have various non-series parameters.)
Question 2: is it the plan to only allow this on series level, or frame level as well? (since some computations you can do much faster internally, handling parallelization yourself if it's problem-specific, as opposed to having polars runtime sort it out).
Question 1: (apologies in advance if the questions are stupid) it's just not immediately clear from the diff... how would you pass extra arguments to those functions? E.g. you have a compiled expression multiply_by which you'd want to call like .multiply_by(1) or .multiply_by(2).
Indeed. Currently it only works for multiple expression arguments. Though many types can be represented as single element series. I'd welcome the ability to make non-series arguments easier, but I would still have to think a little bit about that.
Question 2: is it the plan to only allow this on series level, or frame level as well? (since some computations you can do much faster internally, handling parallelization yourself if it's problem-specific, as opposed to having polars runtime sort it out).
Only on series level. You could ofcourse always accept a series of type struct
to deal with dataframe
like inputs. Handling the parallelism yourself will lead to contention with the default polars runtime. If we were to solve this we also should make the rayon threadpool work over FFI. Which currently is out of scope of this functionality.
Ideally, I think we would have arguments that allow you to influence the paralllism strategy.
Indeed. Currently it only works for multiple expression arguments. Though many types can be represented as single element series. I'd welcome the ability to make non-series arguments easier, but I would still have to think a little bit about that.
Just a random thought, one way would be to come up with a restricted set of allowed argument types safe to send across ffi and thread boundaries, like a json::Value-style enum of all valid scalars (iirc polars already has something similar) plus lists, dicts, nullable stuff etc, and then provide conversions on both pyo3 and rust sides. Along the lines of
// args: Arc<[Arg]>
enum Arg {
Null,
String(String),
List(Arc<[Arg]>),
...
}
It could be implemented completely differently, of course.
This allows support for polars plugins. These are expression exposed in a different shared library and dynamically linked into the polars main library.
This mean we or third parties can create their own expressions and they will run on our engine without python interference. So no blockage by the GIL.
We can therefore keep polars more lean and maybe add support for a
polars-distance
,polars-geo
,polars-ml
, etc. Those can then have specialized expressions and don't have to worry as much for code bloat as they can be optionally installed.The idea is that you define an expression in another Rust crate with a proc_macro
polars_expr
.That macro can have the following attributes:
output_type
-> to define the output type of that expressiontype_func
-> to define a function that computes the output type based on input types.Here is an example of a
String
conversion expression that converts any string to pig latin:On the python side this expression can then be registered under a namespace:
Compile/ship and then it is ready to use:
See the full example here: https://github.com/pola-rs/pyo3-polars/tree/plugin/example/derive_expression