rstudio / reticulate

R Interface to Python
https://rstudio.github.io/reticulate
Apache License 2.0
1.68k stars 327 forks source link

Should py_to_r() copy over attributes? #160

Closed DavisVaughan closed 6 years ago

DavisVaughan commented 6 years ago

I'm playing around with the pendulum python library, and wanted to write a wrapper for a pendulum Date that transformed it into an R Date.

I discovered py_to_r() and the underlying S3, py_to_r_wrapper and tried something like:

library(reticulate)
library(magrittr)
pend <- import("pendulum")

d1 <- pend$Date(2013L, 01L, 01L)

class(d1)
#> [1] "pendulum.date.Date"                       
#> [2] "pendulum.mixins.default.TranslatableMixin"
#> [3] "pendulum.mixins.default.FormattableMixing"
#> [4] "pendulum.mixins.default.TestableMixin"    
#> [5] "datetime.date"                            
#> [6] "python.builtin.object"

# Accessors
pend_year <- function(x) {
  x$year
}
pend_month <- function(x) {
  x$month
}
pend_day <- function(x) {
  x$day
}

# Wrapper for conversion
py_to_r_wrapper.pendulum.date.Date <- function(x) {
  chr_x <- paste(pend_year(x), pend_month(x), pend_day(x), sep = "-")
  as.Date(chr_x)
}

# Both of these nuke RStudio
# py_to_r(d1)
# pend$Date(2013L, 01L, 01L)

Which nearly works, but RStudio crashes. The reason is because of py_to_r() copying over all the attributes from the original python object. It replaces the R Date class with all of the python classes. Commenting out the attribute copying line removes the problem.

# py_to_r() copies over attributes from the original python object. intended?
py_to_r_2 <- function (x) 
{
  reticulate:::ensure_python_initialized()
  if (!inherits(x, "python.builtin.object")) 
    stop("Object to convert is not a Python object")
  x <- reticulate:::py_ref_to_r(x)
  wrapper <- reticulate:::py_to_r_wrapper(x)
  #attributes(wrapper) <- attributes(x)
  wrapper
}

py_to_r_2(d1)
#> [1] "2013-01-01"

I assume this is not intended? Or am I just doing something wrong?

Created on 2018-02-20 by the reprex package (v0.1.1.9000).

jjallaire commented 6 years ago

@kevinushey Added this code to support custom marshaling of Pandas data frames (and other object types, BTW dates are on our short term roadmap!).

Kevin, do you recall the purpose of the attributes(wrapper) <- attributes(x) line? Is this something that should be within the default py_to_r_wrapper implementation as opposed to used for all conversions?

DavisVaughan commented 6 years ago

Happy to hear Pandas data frames are on the horizon, it will be nice to not have to go through feather like I did with featherframe recently. Since a number of libraries use Pandas DataFrames and Series, this will be really useful.

Dates should be interesting, as Pandas doesn't have a "native" non-timezone aware "Date" class, and the datetime.date alternative is orders of magnitudes slower. See this discussion.

Thanks for the hard work.

kevinushey commented 6 years ago

I see this is the commit where we added it:

https://github.com/rstudio/reticulate/commit/9efee53fa95e99b799fc7728871389a79369a48b

My guess is that we did this to ensure some attributes that typically get dropped (e.g. names?) are restored on the copied object, but maybe that was misguided?

jjallaire commented 6 years ago

Actually we need the attributes preserved there. The purpose of py_to_r_wrapper is to allow customization of the wrapped object (as opposed to wholesale replacement of the default implementation). I think what @DavisVaughan is looking for might be contained in your pandas PR: https://github.com/rstudio/reticulate/pull/113. This PR makes py_to_r an S3 method which would allow wholesale replacement of the conversion for an object type.

DavisVaughan commented 6 years ago

Looks promising. Having py_to_r() as the generic feels more natural too.

# devtools::install_github("rstudio/reticulate", ref = devtools::github_pull("113"))

library(reticulate)
pend <- import("pendulum")

# Accessors
pend_year <- function(x) {
  x$year
}
pend_month <- function(x) {
  x$month
}
pend_day <- function(x) {
  x$day
}

# Wrapper for conversion
py_to_r.pendulum.date.Date <- function(x) {
  chr_x <- paste(pend_year(x), pend_month(x), pend_day(x), sep = "-")
  as.Date(chr_x)
}

pend$Date(2013L, 01L, 01L)
#> [1] "2013-01-01"

Created on 2018-02-21 by the reprex package (v0.1.1.9000).

jjallaire commented 6 years ago

We've just merged support for converting to and from Pandas data frames onto master. Would love it if anyone subscribed to this thread could test it out. You can install with:

devtools::install_github("rstudio/reticulate")
DavisVaughan commented 6 years ago

If you want me to move these comments to a more appropriate place let me know. Just a few bugs I'm running into immediately.

1) inconsistent (?) behavior when converting data.frames with 1 element

Related to this on the python side, not sure if its anything you guys can do: https://stackoverflow.com/questions/17839973/construct-pandas-dataframe-from-values-in-variables

library(reticulate)

ex <- data.frame(y = Sys.time())
r_to_py(ex)
#> Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: If using all scalar values, you must pass an index
#> 
#> Detailed traceback: 
#>   File "/usr/local/opt/python3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 900, in from_dict
#>     return cls(data, index=index, columns=columns, dtype=dtype)
#>   File "/usr/local/opt/python3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 330, in __init__
#>     mgr = self._init_dict(data, index, columns, dtype=dtype)
#>   File "/usr/local/opt/python3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 461, in _init_dict
#>     return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
#>   File "/usr/local/opt/python3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 6130, in _arrays_to_mgr
#>     index = extract_index(arrays)
#>   File "/usr/local/opt/python3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 6169, in extract_index
#>     raise ValueError('If using all scalar values, you must pass'

ex2 <- data.frame(x = 1)
r_to_py(ex2)
#>      x
#> 0  1.0

ex3 <- data.frame(x = "hi")
r_to_py(ex3)
#> Error in py_call_impl(callable, dots$args, dots$keywords): TypeError: Index(...) must be called with a collection of some kind, 'hi' was passed
#> 
#> Detailed traceback: 
#>   File "/usr/local/opt/python3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/categorical.py", line 281, in __init__
#>     dtype = CategoricalDtype(categories, ordered)
#>   File "/usr/local/opt/python3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/dtypes/dtypes.py", line 154, in __init__
#>     self._finalize(categories, ordered, fastpath=False)
#>   File "/usr/local/opt/python3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/dtypes/dtypes.py", line 181, in _finalize
#>     fastpath=fastpath)
#>   File "/usr/local/opt/python3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/dtypes/dtypes.py", line 319, in _validate_categories
#>     categories = Index(categories, tupleize_cols=False)
#>   File "/usr/local/opt/python3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 355, in __new__
#>     cls._scalar_data_error(data)
#>   File "/usr/local/opt/python3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 710, in _scalar_data_error
#>     repr(data)))

2) Conversion from datetime.date -> Date issue

library(reticulate)
library(lubridate)

ex <- data.frame(x = 4, y = Sys.Date())

ex_py <- r_to_py(ex)

ex_py$y$values
#> [datetime.date(2018, 3, 6)]

py_to_r(ex_py)
#> Error in as.data.frame.default(x[[i]], optional = TRUE): cannot coerce class "c("datetime.date", "python.builtin.object")" to a data.frame

# Seems like py_ref_to_r didn't actually convert 
# the datetime.date over
item <- list(ex_py[["y"]]$as_matrix())
names(item) <- "y"
converted <- reticulate:::py_ref_to_r(dict(item))
class(converted$y[[1]])
#> [1] "datetime.date"         "python.builtin.object"

3) You're going to need datetime64[ns] support immediately

Since this has to do with py_ref_to_r(), I assume you'll need to support this from the C level Numpy API.

library(reticulate)

ex <- data.frame(x = 4, y = Sys.time())

y_py <- r_to_py(ex$y)
class(y_py)
#> [1] "datetime.datetime"     "datetime.date"         "python.builtin.object"

# The entire data frame
# Individually the y column is converted to datetime.date
# but as soon as it is further passed to pd$DataFrame it is converted
# over to datetime64[ns] by Pandas
ex_py <- r_to_py(ex)

ex_py$y
#> 0   2018-03-06 17:52:34.809640
#> Name: y, dtype: datetime64[ns]

py_to_r(ex_py)
#> Error in py_ref_to_r(dict(items)): Conversion from numpy array type 21 is not supported
kevinushey commented 6 years ago

@jjallaire is there a way to indicate that a length-one vector should be serialized as a vector rather than a scalar? (should reticulate have some helper functions for marking this? E.g. in RStudio we have .rs.scalar() to indicate that something should be unboxed as a scalar rather than a length-one vector; should we have a similar analog / escape-hatch in reticulate?)

kevinushey commented 6 years ago

@DavisVaughan thanks for testing and reporting these! These should mostly be fixed up now (except for the example with Sys.Date()).

Are you aware of whether there's an easy way to detect if a Pandas column is a 'Date' column? E.g. all I see is:

> ex_py$y
0    2018-03-06
Name: y, dtype: object

which unfortunately might imply that we have to independently verify that all of the entries within are datetime objects. Or should we be more proactive and just convert these to NumPy datetime objects rather than Python date objects?

jjallaire commented 6 years ago

Currently we do the automatic conversion to scalar for length 1 then tell users to use e.g. list(1) to indicate they want to preserve the vector/list treatment. I think defaulting to the scalar conversion is the only reasonable choice for end users (since the vast majority of function parameters end up being scalars) On Tue, Mar 6, 2018 at 4:01 PM Kevin Ushey notifications@github.com wrote:

@DavisVaughan https://github.com/davisvaughan thanks for testing and reporting these! These should mostly be fixed up now (except for the example with Sys.Date()).

Are you aware of whether there's an easy way to detect if a Pandas column is a 'Date' column? E.g. all I see is:

ex_py$y 0 2018-03-06 Name: y, dtype: object

which unfortunately might imply that we have to independently verify that all of the entries within are datetime objects. Or should we be more proactive and just convert these to NumPy datetime objects rather than Python date objects?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/rstudio/reticulate/issues/160#issuecomment-370973482, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGXx01jN_qX7nJBewPz2pMfhgth72iXks5tbyNagaJpZM4SMdrA .

DavisVaughan commented 6 years ago

@kevinushey, not as far as I know. It seems pretty difficult to extract any useful information besides that object dtype unless you dig into the individual elements. You seem to have the same problem with character vectors though (shown as objects on the python side), and py_ref_to_r() seems to handle that at the C++ level. Not sure if you can do a similar thing here, or if you want to try a different route.

library(reticulate)
ex <- data.frame(x = "hello", y = Sys.Date(), z = 1, stringsAsFactors = FALSE)
ex_py <- r_to_py(ex)

ex_py$x$dtype
#> object
ex_py$y$dtype
#> object
ex_py$z$dtype
#> float64

class(ex_py$y[[0]])
#> [1] "datetime.date"         "python.builtin.object"

As far as proactively converting Date columns inside a data.frame to NumPy datetime64[ns] columns inside a Pandas DF, I'm not against that. feather does this, as they didn't want to use datetime.date at all because of performance. It probably means you won't be able to go R->Pandas->R and retain the Date columns, but I'd be alright with that. Also, as noted earlier, using a POSIXct column in the data.frame is already forced over to datetime64[ns] so it's not like you're using datetime.datetime right now with that either.

DavisVaughan commented 6 years ago

Two more date based issues.

1) Dropping of fractional seconds

library(reticulate)

ex <- data.frame(y = Sys.time())

ex$y <- lubridate::with_tz(ex$y, "UTC") # Dont think about time zones right now
ex
#>                     y
#> 1 2018-03-07 16:39:15

options(digits.secs = 6) # show fractional seconds
ex
#>                            y
#> 1 2018-03-07 16:39:15.372265

ex_py <- r_to_py(ex)
py_to_r(ex_py) # fractional seconds are dropped
#>                     y
#> 1 2018-03-07 16:39:15

I think you could support them by changing the recent commit of:

else if (inherits(column, "POSIXt")) {
   np_array(as.numeric(column), dtype = "datetime64[s]")
}

to:

else if (inherits(column, "POSIXt")) {
   np_array(as.numeric(column) * 1e9, dtype = "datetime64[ns]")
}

2) Support for time zones (Pandas defaults to UTC)

This one is likely tricky

Update: I see in py_to_r.numpy.ndarray that you explicitly set the tz from Pandas to R to UTC. Update2: This looks difficult as it seems like datetime64[ns] objects themselves might not actually retain a timezone "attribute" and only print based on the user's locale? But it looks like if it was set as the actual index like DatetimeIndex then you can store a tz attribute on that.

library(reticulate)

ex <- data.frame(y = Sys.time())

ex
#>                     y
#> 1 2018-03-07 11:55:01

# I live on the East coast
# This is what R uses if I don't specify a tz
ex$y
#> [1] "2018-03-07 11:55:01 EST"

# Pandas defaults to UTC
ex_py <- r_to_py(ex)
ex_py
#>                     y
#> 0 2018-03-07 16:55:01

# And we get UTC back
ex_back_to_r <- py_to_r(ex_py)
ex_back_to_r
#>                     y
#> 1 2018-03-07 16:55:01

ex_back_to_r$y
#> [1] "2018-03-07 16:55:01 UTC"
kevinushey commented 6 years ago

If feather goes straight to datetime64[ns] then I think we can feel free to do that as well.

I'll take a look at these today as well -- thanks for the testing!

DavisVaughan commented 6 years ago

Here is your proof and reasoning of feather going straight to datetime64[ns] just so you can take a look (I posted this way earlier but easily missable). Thanks for the hard work! https://github.com/wesm/feather/issues/330

kevinushey commented 6 years ago

Thanks! We now try to use datetime64[ns] throughout rather than Python's own datetime class.

For timezones, I'm curious what your opinion on the 'right' behavior is. My understanding is that timezones are only used for printing and formatting -- the underlying representation of the time (a POSIX timestamp) stays the same.

Should we try to preserve the UTC timezone as used by Python? Or just drop the attribute and print with local time?

DavisVaughan commented 6 years ago

tl;dr) I'd go with preserving UTC when coming from Python.

You are correct with stating that the underlying representation stays the same, and timezones are basically metadata for printing here.

I can come up with 2 main problems that you'll have to deal with off the top of my head.

1) The user has not set a tz 2) The user has set a tz

In both cases, at some point in the roundtrip the tz won't be correct (unless their local time is UTC or that is what they set as the tz). However, if we were to pick the "worst of two evils" for either preserving the UTC tz on the way back from Python or dropping the attribute and using local time, I think I'd go with the more predictable former of preserving UTC. I think that is the best case because there is a chance that if we leave it up to local time then the user could see 3 different time zones on a round trip (see Problem 2 in the code below). I think that would be more frustrating then just documenting that UTC is used for Pandas and just making them deal with that.

These problems mainly stem from the fact that it seems like we can't pass over the tz info to Pandas. It just seems like the datetime64[ns] type doesn't "store" the tz unless it is a DatetimeIndex, but I could be wrong.

``` r library(reticulate) ###### Problem 1 - User has not set a tz (no tzone attribute) # The user first sees whatever their local time is, # then Python uses UTC, then you set UTC on the way back to R ex <- data.frame(y = Sys.time()) ex$y #> [1] "2018-03-07 17:21:29 EST" ex_py <- r_to_py(ex) ex_py #> y #> 0 2018-03-07 22:21:29.899784192 ex_back_to_r <- py_to_r(ex_py) ex_back_to_r$y #> [1] "2018-03-07 22:21:29 UTC" # Potential solution: Don't add the UTC tz # This would result in Python being "wrong" but the roundtrip # back to R staying correct attr(ex_back_to_r$y, "tzone") <- NULL ex_back_to_r #> y #> 1 2018-03-07 17:21:29 ###### Problem 2 - User has set a tz ex <- data.frame(y = Sys.time()) attr(ex$y, "tzone") <- "US/Alaska" ex$y #> [1] "2018-03-07 13:21:30 AKST" # There is no way to get the time zone over to Python ex_py <- r_to_py(ex) ex_py #> y #> 0 2018-03-07 22:21:30.692370944 ex_back_to_r <- py_to_r(ex_py) ex_back_to_r$y #> [1] "2018-03-07 22:21:30 UTC" # Potential solution: Don't add the UTC tz # This would result in 3 different time zones!! # R) US/Alaska, Python) UTC, R) Local time (America/New_York) attr(ex_back_to_r$y, "tzone") <- NULL ex_back_to_r #> y #> 1 2018-03-07 17:21:30 ```
jjallaire commented 6 years ago

@kevinushey Can we close this issue?

kevinushey commented 6 years ago

I think it can be closed. Thanks for all the feedback @DavisVaughan, please let us know if you bump into more issues!