pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
26.9k stars 1.65k forks source link

Expose Python interface for other rust applications #1325

Closed jg2562 closed 1 year ago

jg2562 commented 2 years ago

Currently the python-rust interface is within py-polars and is only published to pypi. It would be helpful for other applications that need to pass dataframes over that inferface to have access to the Pyo3 wrapper type.

Is there any way to faciliate have access to the wrapper type to return a dataframe to python using pyo3?

ritchie46 commented 2 years ago

Hi @jg2562 what would you like to do, so that I have a bit more of an understanding what is possible.

jg2562 commented 2 years ago

Hi @ritchie46, thanks for the reply. We are working on an application where the core is written in rust. We use Python to call functions in rust (as most the legacy code is written in Python) and we also use python for quick proof of concepts before finalizing it in rust.

For a more concrete example, we are using serde on a struct containing a DataFrame combined with zstd to create a compressed version of our data (which is nonhomogamous in terms of data types). Since rust is loading the data, we currently need to unpack the data from the dataframe into structs which can be passed back to Python.

I was wondering if there was a way to expose the Python interface as a rust library to allow for us to simply pass the DataFrame to Python directly. It seems like other libraries that are written in rust for Python that want to build off of polars will also run into this issue, so it could help them too!

ritchie46 commented 2 years ago

The easiest thing to do is using arrow and pyarrow to communicate the memory. Then those arrow arrays can be used to create polars dataframes/series in python polars as well as rust polars.

This will mostly be zero copy. Here is the code polars uses to communicate between pyarrow/rust-arrow: https://github.com/pola-rs/polars/tree/master/py-polars/src/arrow_interop

jg2562 commented 2 years ago

Thank you so much! I will definitely look into that. Just out of curiousity, is there something that makes exposing the interface difficult?

ritchie46 commented 2 years ago

Just out of curiousity, is there something that makes exposing the interface difficult?

Well.. TBH, I don't really know what exposing the interface means? Do you mean compiler rust agains python polars?

Or interact with a precompiled rust binary? Or using rust polars and send a dataframe to a python polars process?

jg2562 commented 2 years ago

Thats fair, its pretty vague. I was imagining the last one of having rust polars and sending a dataframe to the python polars processes when I said exposing the interface.

ritchie46 commented 2 years ago

I was imagining the last one of having rust polars and sending a dataframe to the python polars processes when I said exposing the interface.

In that case you should use pyo3 and some copy pasting of the code snippets I referenced. That should work!

jg2562 commented 2 years ago

Hey @ritchie46! I ended up working on a different project for a bit but I finally got around to making a small example. I was able to get the snippets to work, so at least i can better show an example of what I was thinking and why I was wondering if the PyDataFrame could be exposed.

Here is the repo, the use case would be running the example.py but you can see that there was a lot of scripting just to emulate passing the dataframe back and forth across the ffi boundry. Lemme know what you think, and thank you so much for the direction and help!

MarcoLugo commented 2 years ago

Not sure if this is related. I am looking to reuse PyDataFrame in my own library built with pyo3. Is the arrow conversion as @jg2562 did the best way to do it or is there something easier/more direct? Thank you.

I would like to do something like this:

use pyo3::prelude::*;

#[pyfunction]
fn read_my_format() -> PyResult<PyDataFrame> {
    Ok(read_my_format_into_polars_df("my_file"))
}

#[pymodule]
fn my_lib(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(read_my_format, m)?)?;
    Ok(())
}
jg2562 commented 2 years ago

@MarcoLugo, After a recent update the repo that I posted breaks if you try to use some data types (DateTime64 for example). I think it would still be valuable to have access to the PyDataFrame if that's doable, since it will be properly tied to the library and isn't a hack on top of it. However, I really do not know how difficult this is, and so we should consult more with @ritchie46 since he would know much more.

gunjunlee commented 1 year ago

@ritchie46 I wrote a code that converts rust dataframe to python polars dataframe

pub fn rust_dataframe_to_py_dataframe(dataframe: &mut DataFrame) -> PyResult<PyObject> {
    let dataframe = dataframe.rechunk();

    let gil = Python::acquire_gil();
    let py = gil.python();

    let names = dataframe.get_column_names();

    let pyarrow = py.import("pyarrow")?;
    let polars = py.import("polars")?;
    let rbs: Vec<PyObject> = dataframe
        .iter_chunks()
        .map(|rb| to_py_rb(&rb, &names, py, pyarrow).unwrap())
        .collect::<Vec<PyObject>>();
    let rbs: PyObject = rbs.into_py(py);
    let rbs: &PyList = rbs.extract(py)?;
    let py_table = pyarrow.getattr("Table")?.call_method1("from_batches", (rbs, ))?;
    let py_df = polars.call_method1("from_arrow", (py_table, ))?;  // << This line takes much time
    Ok(py_df.to_object(py))
}

but this takes too much time.

I guess there is much easier and faster way to convert rust dataframe to python dataframe, because python dataframe is just a wrapper of rust dataframe

But i don't know how to implement this job. Could you help me?

If it is possible to import py-polars in rust, it will be easy to implement idea above but some reason i cannot import py-polars even i add py-polars in cargo dependency (ex

[dependencies]
py-polars = { path = "polars/py-polars" }

)

cavenditti commented 1 year ago

Hello, I was casually looking into this and just wanted to share some insight with @gunjunlee I'm no Rust expert, so this may be inaccurate. If so, please correct me 🙂

py-polars uses cdylib as crate-type (have a look at linkage reference), this means it cannot be imported in other crates. That specific crate-type is required by PyO3, because it needs to build a dynamic library to end up in the Python wheel. I don't have enough understanding of PyO3 and CPython internals to tell you if (and how) it's possible to create some kind of interface to just write a Rust function returning a PyDataFrame from py-polars and make everything work.

I don't think think there is any reasonable alternative to using arrow and pyarrow

jg2562 commented 1 year ago

I've seen this issue pop up a few times in the last few days (#4264, #4212, kinda #1830). I wanted to reopen discussion to talk about creating an api that is tied the polars development for people to link against. While the current example is very works and is very helpful, it is something that has to be reimplemented in every code base making it not very ergonomic to use. It also isn't tied to development of polars since its being reimplemented, so it falls out of sync and breaks during updates in different peoples projects. @ritchie46 mentioned he was considering making an api in #4212 if he had time, if you would like help with creating it please let us know!

jmrgibson commented 1 year ago

The way I've done this for my projects is to split up the python content into multiple crates. For example, I have a py-interface rlib crate that would contain #[pyfunctions], #[pyclass], etc, that can be used from other rust projects (and would be published to crates.io). Then I have a py-module cdylib crate that simply includes functions/classes from py-interface, and exports them to a #[pymodule].

In this case, we could keep py-polars as the cdylib and make a new (rlib) crate that contains the pyo3 type definitions. I can work on this if people think this is the right direction to go.

jg2562 commented 1 year ago

To me, thats exactly the right direction to go! Just separating them and allowing access to py-interface on crates.io I think would greatly help the rust community to use polars.

jmrgibson commented 1 year ago

@ritchie46 Do you think this is the correct approach?

jmrgibson commented 1 year ago

I'm working on this here: https://github.com/jmrgibson/polars/tree/user/jgibson/split_out_py_polars_as_rust_crate

It appears to work using the nightly compiler. Looks like newer polars relies on simd which is nightly only? I'll continue to investigate, I'd like to get this working on stable.

For example, the following code works:

use py_polars_core::PyDataFrame;
let time: Series = time_ns.into_iter().collect();
let df = Dataframe::new(
    vec![data.clone(), time]
);
let df = PyDataFrame {
    df
};
let args = (df,);
let res = Python::with_gil(|py| -> PyResult<DataFrame> {
     let res = pyfun c.call1(py, args)?; 
     let pdf = res.extract::<PyDataFrame>(py)?;
     Ok(pdf.df)
});
ritchie46 commented 1 year ago

I'm working on this here: https://github.com/jmrgibson/polars/tree/user/jgibson/split_out_py_polars_as_rust_crate

It appears to work using the nightly compiler. Looks like newer polars relies on simd which is nightly only? I'll continue to investigate, I'd like to get this working on stable.

For example, the following code works:

use py_polars_core::PyDataFrame;
let time: Series = time_ns.into_iter().collect();
let df = Dataframe::new(
    vec![data.clone(), time]
);
let df = PyDataFrame {
    df
};
let args = (df,);
let res = Python::with_gil(|py| -> PyResult<DataFrame> {
     let res = pyfun c.call1(py, args)?; 
     let pdf = res.extract::<PyDataFrame>(py)?;
     Ok(pdf.df)
});

I don't think we should shop the python interface for that. We could use arrows c interface for that. That is zero copy and much slimmer.

jmrgibson commented 1 year ago

I don't think we should shop the python interface for that. We could use arrows c interface for that. That is zero copy and much slimmer.

I don't think I understand enough about pyo3 to figure out where the copying is happening this case.

E.g. If I want to call a python function with a dataframe I create in rust, and get a dataframe back to rust:

# module.py
def manipulate_df(df: pl.DataFrame) -> pl.DataFrame:
    ...  # user writes manipulation function here
fn main(){
  let df = df!(
      "data" => [1.0, 2.0],
      "time" => [1.0, 2.0],
  );

  let modified_df = Python::with_gil(|py| {
      let module = PyModule::import(py, "module")?;
      let pydf: PyDataFrame = df.into();
      let args = (pydf,);
      let result: PyDataFrame = builtins.getattr("manipulate_df")?.call1(args)?.extract()?;
      Ok(result.df)
  })?;
}

Based on the docs for Py::new, which is what the default #[pyclass] uses, this is creating a new object on the python heap. Does that mean the entire inner DataFrame is getting copied from the rust stack to the python heap?

AnatolyBuga commented 1 year ago

@ritchie46 , do you think it's possible to conver LazyFrame from Python to Rust and back like you did here with Eager frame?

ritchie46 commented 1 year ago

@ritchie46 , do you think it's possible to conver LazyFrame from Python to Rust and back like you did here with Eager frame?

You'd need to serialize the query plan. This will copy data if you use df.lazy(). If you start your query with pl.scan_x then it won't.

kylebarron commented 1 year ago

I don't think we should shop the python interface for that. We could use arrows c interface for that. That is zero copy and much slimmer.

I think this is a good suggestion for something to make the python interface easier for third party bindings. The example code in the python_rust_compiled_function directory only shows how to transfer a single Series through the C Data interface. The C Data interface doesn't define how to transfer an entire DataFrame per se, but you can do it by convention by calling a DataFrame a struct of all the columns in the DataFrame you wish to move. That would be helpful helper code to make available to people wanting to extend Polars but who don't have a ton of Arrow experience

ritchie46 commented 1 year ago

I have a setup of a crate that does this for you hidden behind pyo3 bindings. But haven't yet had the bandwidth/priority to finish this.

AnatolyBuga commented 1 year ago

I have a setup of a crate that does this for you hidden behind pyo3 bindings. But haven't yet had the bandwidth/priority to finish this.

@ritchie46 that would be really useful, especially for types beyond Series/DataFrame (like LazyFrame). I can try helping (although I am still abit of a noob)

iskandr commented 1 year ago

I just want to echo that a succinct example of how to create a PyDataFrame in a new Rust project and pass it back into Python code would be very helpful to me and @andyjslee

kylebarron commented 1 year ago

@ritchie46 mentioned on discord: https://github.com/pola-rs/pyo3-polars

ritchie46 commented 1 year ago

Yes, this is the way to go.

OliverEvans96 commented 4 months ago

Thanks, the pyo3-polars crate is exactly what I was looking for!