pola-rs / pyo3-polars

Plugins/extension for Polars
MIT License
232 stars 38 forks source link

The first call to pyfunction returning PyDataFrame is slow #72

Open Androidown opened 5 months ago

Androidown commented 5 months ago

I found that the very first call to a pyfunction which returns PyDataFrame has a 100ms lag.

Here is a minimal reproducible example:

extension


use pyo3_polars::PyDataFrame;
use pyo3::prelude::*;

[pyfunction]

fn dup(pydf: PyDataFrame) -> PyDataFrame { let df = pydf.0; let new_df = df.vstack(&df.clone()).unwrap(); PyDataFrame(new_df) // Python::with_gil(|py| into_py(&PyDataFrame(new_df), py)) }

[pymodule]

[pyo3(name = "test")]

fn py(_py: Python, m: &PyModule) -> PyResult<()> { m.add_function(wrap_pyfunction!(dup, m)?)?; Ok(()) }

> ipython

In [1]: import test ...: import polars as pl ...: df = pl.DataFrame({'a': [1], 'b': [2]})

In [2]:

In [2]: %time py.dup(df) Wall time: 110 ms Out[2]: shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 2 │ │ 1 ┆ 2 │ └─────┴─────┘

In [3]: %time test.dup(df) Wall time: 0 ns Out[3]: shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 2 │ │ 1 ┆ 2 │ └─────┴─────┘


Is there anything I can do to eliminate this delay?
sdd commented 3 weeks ago

I see the same issue. It happens even when trivially returning an empty dataframe. I've created a minimally reproducible example here: https://github.com/sdd/py03-bug

❯ python main.py
shape: (0, 0)
┌┐
╞╡
└┘
time taken for first: 42.98 ms
shape: (0, 0)
┌┐
╞╡
└┘
time taken for second: 0.02 ms
sdd commented 3 weeks ago

I have a more complex module in another use case and the delay in that module is 600ms when first returning a dataframe, even if it is empty.

sdd commented 3 weeks ago

Update: I think I've solved my own problem.

The python script did not contain import polars, so pyo3-polars was having to do that first.

Once my script had import polars at the top, the first call was just as fast as any other.