pola-rs / pyo3-polars

Plugins/extension for Polars
MIT License
238 stars 39 forks source link

feat: expression plugins #26

Closed ritchie46 closed 11 months ago

ritchie46 commented 12 months ago

This allows support for polars plugins. These are expression exposed in a different shared library and dynamically linked into the polars main library.

This mean we or third parties can create their own expressions and they will run on our engine without python interference. So no blockage by the GIL.

We can therefore keep polars more lean and maybe add support for a polars-distance, polars-geo, polars-ml, etc. Those can then have specialized expressions and don't have to worry as much for code bloat as they can be optionally installed.

The idea is that you define an expression in another Rust crate with a proc_macro polars_expr.

That macro can have the following attributes:

Here is an example of a String conversion expression that converts any string to pig latin:

fn pig_latin_str(value: &str, output: &mut String) {
    if let Some(first_char) = value.chars().next() {
        write!(output, "{}{}ay", &value[1..], first_char).unwrap()
    }
}

#[polars_expr(output_type=Utf8)]
fn pig_latinnify(inputs: &[Series]) -> PolarsResult<Series> {
    let ca = inputs[0].utf8()?;
    let out: Utf8Chunked = ca.apply_to_buffer(pig_latin_str);
    Ok(out.into_series())
}

On the python side this expression can then be registered under a namespace:

import polars as pl
from polars.utils.udfs import _get_shared_lib_location

lib = _get_shared_lib_location(__file__)

@pl.api.register_expr_namespace("language")
class Language:
    def __init__(self, expr: pl.Expr):
        self._expr = expr

    def pig_latinnify(self) -> pl.Expr:
        return self._expr._register_plugin(
            lib=lib,
            symbol="pig_latinnify",
            is_elementwise=True,
        )

Compile/ship and then it is ready to use:

import polars as pl
from expression_lib import Language

df = pl.DataFrame({
    "names": ["Richard", "Alice", "Bob"],
})

out = df.with_columns(
   pig_latin = pl.col("names").language.pig_latinnify()
)

See the full example here: https://github.com/pola-rs/pyo3-polars/tree/plugin/example/derive_expression

aldanor commented 12 months ago

Pretty exciting!

Here is an example of a String conversion expression that converts any string to pig latin

Question 1: (apologies in advance if the questions are stupid) it's just not immediately clear from the diff... how would you pass extra arguments to those functions? E.g. you have a compiled expression multiply_by which you'd want to call like .multiply_by(1) or .multiply_by(2).

(You'd think many expressions, including prebuilt ones, often have various non-series parameters.)

Question 2: is it the plan to only allow this on series level, or frame level as well? (since some computations you can do much faster internally, handling parallelization yourself if it's problem-specific, as opposed to having polars runtime sort it out).

ritchie46 commented 11 months ago

Question 1: (apologies in advance if the questions are stupid) it's just not immediately clear from the diff... how would you pass extra arguments to those functions? E.g. you have a compiled expression multiply_by which you'd want to call like .multiply_by(1) or .multiply_by(2).

Indeed. Currently it only works for multiple expression arguments. Though many types can be represented as single element series. I'd welcome the ability to make non-series arguments easier, but I would still have to think a little bit about that.

Question 2: is it the plan to only allow this on series level, or frame level as well? (since some computations you can do much faster internally, handling parallelization yourself if it's problem-specific, as opposed to having polars runtime sort it out).

Only on series level. You could ofcourse always accept a series of type struct to deal with dataframe like inputs. Handling the parallelism yourself will lead to contention with the default polars runtime. If we were to solve this we also should make the rayon threadpool work over FFI. Which currently is out of scope of this functionality.

Ideally, I think we would have arguments that allow you to influence the paralllism strategy.

aldanor commented 11 months ago

Indeed. Currently it only works for multiple expression arguments. Though many types can be represented as single element series. I'd welcome the ability to make non-series arguments easier, but I would still have to think a little bit about that.

Just a random thought, one way would be to come up with a restricted set of allowed argument types safe to send across ffi and thread boundaries, like a json::Value-style enum of all valid scalars (iirc polars already has something similar) plus lists, dicts, nullable stuff etc, and then provide conversions on both pyo3 and rust sides. Along the lines of

// args: Arc<[Arg]>

enum Arg {
   Null,
   String(String),
   List(Arc<[Arg]>),
   ...
}

It could be implemented completely differently, of course.