pola-rs / pyo3-polars

Plugins/extension for Polars
MIT License
232 stars 38 forks source link

expression plugin function with multiple columns as input #49

Closed Macfly closed 5 months ago

Macfly commented 8 months ago

I've looked as the example but the haversine one is a bit confusing. It looks like the python example gives 5 parameters to the Rust function so the last one is ignored as the functions only takes 4 parameters.

What is the best way to implement a function that takes multiple columns as parameters and still use all the parallelism and optimization from Polars?

oscar6echo commented 8 months ago

Title

I concur.

The definition of the column is haversine=pl.col("floats").dist.haversine("floats", "floats", "floats", "floats") with sample column floats [5.6, -1245.8, 242.224].
And the unpacking of the data is:

#[polars_expr(output_type_func=haversine_output)]
fn haversine(inputs: &[Series]) -> PolarsResult<Series> {
    let out = match inputs[0].dtype() {
        DataType::Float32 => {
            let start_lat = inputs[0].f32().unwrap();
            let start_long = inputs[1].f32().unwrap();
            let end_lat = inputs[2].f32().unwrap();
            let end_long = inputs[3].f32().unwrap();
// etc

So from the other examples, we understand that inputs[0] is the column on which the plugin function is applied, and inputs[1, 2, 3] are the first 3 column names, and the last one is not used ! If this is correct, this is a bit misleading, isn't it ?

The pig_latinnify and pig_latinnify_with_parallelism make sense, work as expected (I cannot judge the parallelism) and are didactic demos, and so is the kwargs example.

But I would argue a a potentially very generic sample function with several columns as input and several columns as ouput would be more helpful, and, in terms of API it would quite readable to have the inputs and outputs as pl.struct.

Here is the python shell, super generic:

@pl.api.register_expr_namespace("demo")
class Demo:
    def __init__(self, expr: pl.Expr):
        self._expr = expr

    def demo_calc(self) -> pl.Expr:
        return self._expr._register_plugin(
            lib=lib,
            symbol="demo_calc",
            is_elementwise=True,
            kwargs={},
        )

A vaguely worded rust part would be:

// struct input column
struct FuncStructIn {
    a_float: bool,
    a_int: i64,
    a_string: String,
}

// struct output column
struct FuncStructOut {
    a_valid: bool,
    a_one: f64,
    a_two: f64,
}

// output type is a struct with schema FuncStructOut
#[polars_expr(output_type=FuncStructOut)]
fn demo_calc(inputs: &[Series], kwargs: PigLatinKwargs) -> PolarsResult<Series> {

    // unpack struct to individual variables, with all the relevant error checking
    // obviously this does not work but you get the idea
    let s_in: FuncStructIn = inputs[0].struct();

    let a_float : &ChunkedArray<T>= =    s_in.a_float;
    let a_int = s_in.a_int;
    let a_string = s_in.a_string;

    // impl_demo_calc contains the actuall processing
    let s_out: FuncStructOut = impl_demo_calc(a_float, a_int, a_string)

    // returns result
    // again syntax is super approximative
    Ok(s_out.into_series())
}

// approx syntax
pub(super) fn impl_demo_calc(
    a_float: &ChunkedArray<Float64Type>,
    a_int: &ChunkedArray<Int64Type>,
    a_string: &ChunkedArray<Utf8Type>,
) -> PolarsResult<ChunkedArray<FuncStructOut>> {
    // can be anything but return a FuncStructOut shaped struct
    // unless compute error due to bad input
}

// once the this is done, in a second stage, probably a _with parallelism version can be derived.

Hopefully the intention is clear.

Unfortunately, I don't know enough about polars-rs, and to be honest rust in general, at the moment.
The unpacking/re packing in particular is obscure to me..

If there is no serious drawback to this approach - eg. perf cost creating the input struct - then, once the unpacking/repacking and error management is demonstrated it can become a useful boilerplate to start from for many use cases/people.

Whether maintainers are willing to improve this part of the doc or not, I would be grateful for pointers to continue exploring.

Macfly commented 6 months ago

looks like the dev has been done: https://github.com/pola-rs/pyo3-polars/commit/870da1e00ce2024eddab2fd724cc9b605716e2a9

But I'm still not sure how we can pass multiple columns to a function.

oscar6echo commented 6 months ago

looks like the dev has been done: https://github.com/pola-rs/pyo3-polars/commit/870da1e00ce2024eddab2fd724cc9b605716e2a9

I think this solves a use case brought forward by polars extension plugin author Ion Koutsouris in this issue https://github.com/pola-rs/pyo3-polars/issues/58.

In the meantime I've learnt some Rust and done some research and existing comunity plugins are the best information sources. There I could find exemples with input as a single column of type list or struct, and output as struct. Aggregating several columns into a col of type struct is easy (and cheap in lazy mode, it seems). Same for unpacking a struct into several columns.

I am not far from testing it and if confirmed, then my naive question above will be answered.

MarcoGorelli commented 6 months ago

Does this help? https://marcogorelli.github.io/polars-plugins-tutorial/sum/

I think you're right about the haversine example though, will make a pr to clear it up

Macfly commented 6 months ago

Does this help? https://marcogorelli.github.io/polars-plugins-tutorial/sum/

I think you're right about the haversine example though, will make a pr to clear it up

yes thanks a lot. One more thing that is not clear. it looks like "pig_latinnify_with_paralellism" is not used anywhere, just defined. When do we need to use a library like Rayon to speed things up as one of the advantage of using shared library plugins is to have "Parallelism and optimizations (are) managed by the default polars runtime"?

MarcoGorelli commented 6 months ago

it looks like "pig_latinnify_with_paralellism" is not used anywhere, just defined

I have the same question :)

oscar6echo commented 6 months ago

Does this help? https://marcogorelli.github.io/polars-plugins-tutorial/sum/

YES it does !
I read through it all and implemented along the way - not done the suggested extra work yet.
But it is a very valuable help :pray:

For somebody intermediate in rust, beginner in rust polarrs, proficient in python polars, the following sections, in particular, got my attention: