pola-rs / pyo3-polars

Plugins/extension for Polars
MIT License
236 stars 38 forks source link

Sharing reference in pyclass #7

Open sdrap opened 1 year ago

sdrap commented 1 year ago

Thanks for the great work with polars as well as the pyo3 exposition of dataframe/series.

I have the following problem (disclaimer still a couple of hours away form hello rust).

I want a rust struct containing several dataframes accessible in python with polars (and eventually zero copy to pandas with arrow).

I don't quite understand how to pass a shared reference in that case.

Here is a minimal example.

[dependencies]
pyo3 = "0.18.1"
polars = { version = "0.27", default_features = false }
polars-core = { version = "0.27", default_features = false }
thiserror = "1"
arrow2 = "0.16"
pyo3-polars = "0.2.0"
use pyo3::prelude::*;
use pyo3_polars::PyDataFrame;

#[pyclass]
struct Container {
    #[pyo3(get)]
    df: PyDataFrame,
}

#[pymethods]
impl Container {
    #[new]
    fn new(somedf: PyDataFrame) -> Self {
        let df = somedf.into();
        Self { df }
    }
}

#[pymodule]
fn my_module(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_class::<Container>()?;
    Ok(())
}

compilation is ok, now in python

import mylib                 # dumb module name  
import polars as pl

df = pl.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
mycontainer = mylib.Container(df)

print(mycontainer.df)           # print df

a = mycontainer.df
a[0, 'A'] = 5

print(a)                                  # the first row of col 'A' of a is now 5
print(mycontainer.df)         # is sill df

I thought that everything would be zero-copy or just passing of references so that any manipulation on the rust or python side would be on the same reference object.

I might be a bit naive (there is something about using some Py<..> for classes in pyo3 but I don't understand how it fits within pyo3-polars.

Many thanks and sorry for a beginner kind of question.

aldanor commented 10 months ago

It's not really a pyo3-polars problem.

Everything is zero-copy indeed... until you start mutating things like you try to.

In general, you want to avoid doing a[0, 'A'] = 5 since it may result in copying the entire 'A' column.

// See PySeries::set_at_idx().

To have in/out-mutable array data on both Rust/Python sides, you can use numpy wrappers if that fits the use case.