rstudio / reticulate

R Interface to Python
https://rstudio.github.io/reticulate
Apache License 2.0
1.68k stars 328 forks source link

Feature request: Support Polars DataFrames #1319

Open vnijs opened 1 year ago

vnijs commented 1 year ago

I have been using Polars in Python and it is a wonderful, fast, DataFrame library for Python and Rust. There even seems to be work on creating R-bindings for polars as well (https://github.com/pola-rs/r-polars).

I use reticulate a lot in shiny apps and it would be great if reticulate could also support the Polars DataFrame format, at least in terms being able to convert a Polars DataFrame to an R data.frame. Since polars is based on Arrow, I hope this may be possible.

Below an example of what happens currently when using reticulate with a polars data.frame.

library(reticulate)

test_str <- '
import polars as pl

df = pl.DataFrame({
  "a": [1, 2, 3],
  "b": ["x", "y", "z"]
})
'

answer <- py_run_string(test_str)

# works
py$df

# reports false
is.data.frame(py$df)

# DataFrame looks good
py$df

# row indexing works
py$df[0]
py$df[2]
py$df[3]

# column indexing works
py$df["b"]
py$df["a"]

# not all indexing works
py$df[2, :]

# reports an error
as.data.frame(py$df)

# Error in as.data.frame.default(py$df) :
#  cannot coerce class 'c("polars.internals.dataframe.frame.DataFrame", "python.builtin.object"' to a data.frame
OmarAshkar commented 1 year ago

+1 for this one. Maybe adding an option() argument to choose between panadas and polars.

dfalbel commented 1 year ago

Just to make sure I understand the request correctly.

We could implement the py_to_r method for polars data frames. This means that whenever a python function called by reticulate returned a polars data frame, it would be converted into an R data frame. This is the same behavior as we have for pandas. Users can opt out by passing convert = FALSE when importing the module.

For an example, if we implemented py_to_r for polars data frames, calling something like the below would return an R data frame, while it currently returns a polars pointer to a Polars data frame.

polars <- reticulate::import("polars")
df <- polars$dataframe$DataFrame(data = list(
  hello = 1:5
))
df

To be fair, you can get an R data.frame pretty easily by doing:

df$to_pandas()

which will trigger py_to_r method for pandas data frames.

We could also add an option to the r_to_py dataframe method, so R dataframes get converted into polars data frames when cast to Python objects.

Is that what you are suggesting? I don't have strong feelings about either option. However if we add py_to_r for polars data frames it will be a potential breaking change as users might already be relying on the fact that polars data frames aren't automatically cast into R objects.

OmarAshkar commented 1 year ago

Yes I am for an automatic py_to_r(). And definitely a parameter should be available for users.

dfalbel commented 1 year ago

@OmarAshkar, do you have an example of some usage that automatic convertion is much nicer than calling .to_pandas(). I'm leaning towards not implementing this in reticulate as casting is simple one-liner and it's probably going to be a breaking change for some users.

vnijs commented 1 year ago

@dfalbel Thanks for taking a look at this. What exactly would break? The fact that folks focusing on polars could remove steps in their work? If there are any breaks, I assume they would be quite happy about things being made simpler. It would definitely make writing tests for python/polars to be executed through reticulate much easier.

t-kalinowski commented 1 year ago

I think what @dfalbel is suggesting is that users likely have existing workflows where they are expecting polars dataframes to not eagerly convert to R dataframes, (similar to how TensorFlow tensors don't automatically converting to R arrays, even when convert = TRUE).

The most minimal changes I can think of, that won't break existing workflows, would be to add an as.data.frame.<polars-df> method, which could simply be as_r_value(x$to_pandas()). This would make as_tibble() work as well.

t-kalinowski commented 1 year ago

We can also add a [.<polars-df> method, to make missing axes more ergonomic. E.g., make py$df[2, ] equivalent to df[2, :] in python.

Today, if you want to pass a python : to [, that can be done (admittedly, not very ergonomically) like this:

bt <- import_builtins()
bt$slice(NULL)

for example

py$df[2, bt$slice(NULL)]
t-kalinowski commented 1 year ago

The current version of reticulate brings slice support to [ and [<-. (Added in #1432).

This now works:

## slice a NumPy array
x <- np_array(array(1:64, c(4, 4, 4)))

# R expression | Python expression
# ------------ | -----------------
  x[0]         # x[0]
  x[, 0]       # x[:, 0]
  x[, , 0]     # x[:, :, 0]

  x[NA:2]      # x[:2]
  x[`:2`]      # x[:2]

  x[2:NA]      # x[2:]
  x[`2:`]      # x[2:]

  x[NA:NA:2]   # x[::2]
  x[`::2`]     # x[::2]

  x[1:3:2]     # x[1:3:2]
  x[`1:3:2`]   # x[1:3:2]

See ?py_get_item for examples.

The same syntax should work for Polars DataFrames.

plasmak11 commented 10 months ago

Would love to see this as well!

tontief commented 2 months ago

what's the status on this? currently, when using polars in quarto with revealjs, rendering is terrible. Is it possible to pre-process everything and use to_pandas without showing that?

t-kalinowski commented 2 months ago

CC @cderv, do you have any thoughts about ☝🏻 ?