rstudio / reticulate

R Interface to Python
https://rstudio.github.io/reticulate
Apache License 2.0
1.68k stars 328 forks source link

Metadata loss with Pandas Series #275

Open DavisVaughan opened 6 years ago

DavisVaughan commented 6 years ago

Pandas Series can contain valuable metadata like an Index and a Name. Currently all of this is lost because of the use of py_to_r(x$values) in the conversion. (Note that Indices are required and Names are optional for Series)

library(reticulate)

pd <- import("pandas", convert = FALSE)
np <- import("numpy", convert = FALSE)

s = pd$Series(np$random$randn(5L), index=list('a', 'b', 'c', 'd', 'e'))
s$name = "hi"

s
#> a    0.205348
#> b    1.159546
#> c    0.720544
#> d    0.618997
#> e    0.325620
#> Name: hi, dtype: float64

py_to_r(s)
#> [1] 0.2053485 1.1595464 0.7205440 0.6189972 0.3256199

If you wanted to keep this information, I see 2 options.

1) Treat a Series as a 1 column data frame.

py_to_r.pandas.core.series.Series <- function(x) {

  disable_conversion_scope(x)

  x_name <- py_to_r(x$name)

  if(is.null(x_name)) {
    x$name <- "value"
  }

  py_to_r(x$to_frame())

}

Resulting in:

> py_to_r.pandas.core.series.Series(s)
          hi
a -0.2501778
b  1.3094112
c  0.3495218
d  1.7427713
e -0.3065464

2) Treat the Series like a named vector

py_to_r.pandas.core.series.Series <- function(x) {

  disable_conversion_scope(x)
  x_r <- py_to_r(x$values)
  index <- x$index

  # New function
  x_r <- add_index_rownames(x_r, index)

  x_r
}

Resulting in:

> py_to_r.pandas.core.series.Series(s)
         a          b          c          d          e 
-0.2501778  1.3094112  0.3495218  1.7427713 -0.3065464 
attr(,"pandas.index")
Index([u'a', u'b', u'c', u'd', u'e'], dtype='object')
kevinushey commented 6 years ago

Overall, I'm a fan of the second proposal. Given this:

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.

https://pandas.pydata.org/pandas-docs/stable/dsintro.html

I think it's "close enough" to say that a Series object is an indexable array, and so setting the names (while preserving other metadata) is the best way forward. In other words, we should convert them roughly the same as we do NumPy arrays, while attempting to preserve relevant object metadata.

In addition, since I think it's common to have Series objects live as part of a Pandas DataFrame, it would be more natural to treat them as vectors rather than one-column DataFrames.