Metadata loss with Pandas Series

Pandas Series can contain valuable metadata like an Index and a Name. Currently all of this is lost because of the use of py_to_r(x$values) in the conversion. (Note that Indices are required and Names are optional for Series)

library(reticulate)

pd <- import("pandas", convert = FALSE)
np <- import("numpy", convert = FALSE)

s = pd$Series(np$random$randn(5L), index=list('a', 'b', 'c', 'd', 'e'))
s$name = "hi"

s
#> a    0.205348
#> b    1.159546
#> c    0.720544
#> d    0.618997
#> e    0.325620
#> Name: hi, dtype: float64

py_to_r(s)
#> [1] 0.2053485 1.1595464 0.7205440 0.6189972 0.3256199

If you wanted to keep this information, I see 2 options.

1) Treat a Series as a 1 column data frame.

You have to explicitly catch unnamed Series, but you get the benefit of keeping that metadata if it was there.
You implicitly get the benefit of the data frame index conversion that is in place over there.

py_to_r.pandas.core.series.Series <- function(x) {

  disable_conversion_scope(x)

  x_name <- py_to_r(x$name)

  if(is.null(x_name)) {
    x$name <- "value"
  }

  py_to_r(x$to_frame())

}

Resulting in:

> py_to_r.pandas.core.series.Series(s)
          hi
a -0.2501778
b  1.3094112
c  0.3495218
d  1.7427713
e -0.3065464

2) Treat the Series like a named vector

There will be no way to keep the name (unless stored as an attribute)
If conceptually the R equivalent of a Series is a vector this makes more sense.
You'll have to extract out the index conversion code from py_to_r.pandas.core.frame.DataFrame into it's own function for use here, but it should "just work" for Series too.

py_to_r.pandas.core.series.Series <- function(x) {

  disable_conversion_scope(x)
  x_r <- py_to_r(x$values)
  index <- x$index

  # New function
  x_r <- add_index_rownames(x_r, index)

  x_r
}

Resulting in:

> py_to_r.pandas.core.series.Series(s)
         a          b          c          d          e 
-0.2501778  1.3094112  0.3495218  1.7427713 -0.3065464 
attr(,"pandas.index")
Index([u'a', u'b', u'c', u'd', u'e'], dtype='object')

Overall, I'm a fan of the second proposal. Given this:

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.

https://pandas.pydata.org/pandas-docs/stable/dsintro.html

I think it's "close enough" to say that a Series object is an indexable array, and so setting the names (while preserving other metadata) is the best way forward. In other words, we should convert them roughly the same as we do NumPy arrays, while attempting to preserve relevant object metadata.

In addition, since I think it's common to have Series objects live as part of a Pandas DataFrame, it would be more natural to treat them as vectors rather than one-column DataFrames.

rstudio / reticulate

Metadata loss with Pandas Series #275