Closed DavisVaughan closed 6 years ago
@kevinushey Added this code to support custom marshaling of Pandas data frames (and other object types, BTW dates are on our short term roadmap!).
Kevin, do you recall the purpose of the attributes(wrapper) <- attributes(x)
line? Is this something that should be within the default py_to_r_wrapper
implementation as opposed to used for all conversions?
Happy to hear Pandas data frames are on the horizon, it will be nice to not have to go through feather
like I did with featherframe
recently. Since a number of libraries use Pandas DataFrames and Series, this will be really useful.
Dates should be interesting, as Pandas doesn't have a "native" non-timezone aware "Date" class, and the datetime.date
alternative is orders of magnitudes slower. See this discussion.
Thanks for the hard work.
I see this is the commit where we added it:
https://github.com/rstudio/reticulate/commit/9efee53fa95e99b799fc7728871389a79369a48b
My guess is that we did this to ensure some attributes that typically get dropped (e.g. names?) are restored on the copied object, but maybe that was misguided?
Actually we need the attributes preserved there. The purpose of py_to_r_wrapper
is to allow customization of the wrapped object (as opposed to wholesale replacement of the default implementation). I think what @DavisVaughan is looking for might be contained in your pandas PR: https://github.com/rstudio/reticulate/pull/113. This PR makes py_to_r
an S3 method which would allow wholesale replacement of the conversion for an object type.
Looks promising. Having py_to_r()
as the generic feels more natural too.
# devtools::install_github("rstudio/reticulate", ref = devtools::github_pull("113"))
library(reticulate)
pend <- import("pendulum")
# Accessors
pend_year <- function(x) {
x$year
}
pend_month <- function(x) {
x$month
}
pend_day <- function(x) {
x$day
}
# Wrapper for conversion
py_to_r.pendulum.date.Date <- function(x) {
chr_x <- paste(pend_year(x), pend_month(x), pend_day(x), sep = "-")
as.Date(chr_x)
}
pend$Date(2013L, 01L, 01L)
#> [1] "2013-01-01"
Created on 2018-02-21 by the reprex package (v0.1.1.9000).
We've just merged support for converting to and from Pandas data frames onto master. Would love it if anyone subscribed to this thread could test it out. You can install with:
devtools::install_github("rstudio/reticulate")
If you want me to move these comments to a more appropriate place let me know. Just a few bugs I'm running into immediately.
1) inconsistent (?) behavior when converting data.frames with 1 element
Related to this on the python side, not sure if its anything you guys can do: https://stackoverflow.com/questions/17839973/construct-pandas-dataframe-from-values-in-variables
library(reticulate)
ex <- data.frame(y = Sys.time())
r_to_py(ex)
#> Error in py_call_impl(callable, dots$args, dots$keywords): ValueError: If using all scalar values, you must pass an index
#>
#> Detailed traceback:
#> File "/usr/local/opt/python3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 900, in from_dict
#> return cls(data, index=index, columns=columns, dtype=dtype)
#> File "/usr/local/opt/python3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 330, in __init__
#> mgr = self._init_dict(data, index, columns, dtype=dtype)
#> File "/usr/local/opt/python3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 461, in _init_dict
#> return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
#> File "/usr/local/opt/python3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 6130, in _arrays_to_mgr
#> index = extract_index(arrays)
#> File "/usr/local/opt/python3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 6169, in extract_index
#> raise ValueError('If using all scalar values, you must pass'
ex2 <- data.frame(x = 1)
r_to_py(ex2)
#> x
#> 0 1.0
ex3 <- data.frame(x = "hi")
r_to_py(ex3)
#> Error in py_call_impl(callable, dots$args, dots$keywords): TypeError: Index(...) must be called with a collection of some kind, 'hi' was passed
#>
#> Detailed traceback:
#> File "/usr/local/opt/python3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/categorical.py", line 281, in __init__
#> dtype = CategoricalDtype(categories, ordered)
#> File "/usr/local/opt/python3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/dtypes/dtypes.py", line 154, in __init__
#> self._finalize(categories, ordered, fastpath=False)
#> File "/usr/local/opt/python3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/dtypes/dtypes.py", line 181, in _finalize
#> fastpath=fastpath)
#> File "/usr/local/opt/python3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/dtypes/dtypes.py", line 319, in _validate_categories
#> categories = Index(categories, tupleize_cols=False)
#> File "/usr/local/opt/python3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 355, in __new__
#> cls._scalar_data_error(data)
#> File "/usr/local/opt/python3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 710, in _scalar_data_error
#> repr(data)))
2) Conversion from datetime.date -> Date issue
library(reticulate)
library(lubridate)
ex <- data.frame(x = 4, y = Sys.Date())
ex_py <- r_to_py(ex)
ex_py$y$values
#> [datetime.date(2018, 3, 6)]
py_to_r(ex_py)
#> Error in as.data.frame.default(x[[i]], optional = TRUE): cannot coerce class "c("datetime.date", "python.builtin.object")" to a data.frame
# Seems like py_ref_to_r didn't actually convert
# the datetime.date over
item <- list(ex_py[["y"]]$as_matrix())
names(item) <- "y"
converted <- reticulate:::py_ref_to_r(dict(item))
class(converted$y[[1]])
#> [1] "datetime.date" "python.builtin.object"
3) You're going to need datetime64[ns]
support immediately
Since this has to do with py_ref_to_r()
, I assume you'll need to support this from the C level Numpy API.
library(reticulate)
ex <- data.frame(x = 4, y = Sys.time())
y_py <- r_to_py(ex$y)
class(y_py)
#> [1] "datetime.datetime" "datetime.date" "python.builtin.object"
# The entire data frame
# Individually the y column is converted to datetime.date
# but as soon as it is further passed to pd$DataFrame it is converted
# over to datetime64[ns] by Pandas
ex_py <- r_to_py(ex)
ex_py$y
#> 0 2018-03-06 17:52:34.809640
#> Name: y, dtype: datetime64[ns]
py_to_r(ex_py)
#> Error in py_ref_to_r(dict(items)): Conversion from numpy array type 21 is not supported
@jjallaire is there a way to indicate that a length-one vector should be serialized as a vector rather than a scalar? (should reticulate
have some helper functions for marking this? E.g. in RStudio we have .rs.scalar()
to indicate that something should be unboxed as a scalar rather than a length-one vector; should we have a similar analog / escape-hatch in reticulate?)
@DavisVaughan thanks for testing and reporting these! These should mostly be fixed up now (except for the example with Sys.Date()
).
Are you aware of whether there's an easy way to detect if a Pandas column is a 'Date' column? E.g. all I see is:
> ex_py$y
0 2018-03-06
Name: y, dtype: object
which unfortunately might imply that we have to independently verify that all of the entries within are datetime objects. Or should we be more proactive and just convert these to NumPy datetime
objects rather than Python date objects?
Currently we do the automatic conversion to scalar for length 1 then tell users to use e.g. list(1) to indicate they want to preserve the vector/list treatment. I think defaulting to the scalar conversion is the only reasonable choice for end users (since the vast majority of function parameters end up being scalars) On Tue, Mar 6, 2018 at 4:01 PM Kevin Ushey notifications@github.com wrote:
@DavisVaughan https://github.com/davisvaughan thanks for testing and reporting these! These should mostly be fixed up now (except for the example with Sys.Date()).
Are you aware of whether there's an easy way to detect if a Pandas column is a 'Date' column? E.g. all I see is:
ex_py$y 0 2018-03-06 Name: y, dtype: object
which unfortunately might imply that we have to independently verify that all of the entries within are datetime objects. Or should we be more proactive and just convert these to NumPy datetime objects rather than Python date objects?
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/rstudio/reticulate/issues/160#issuecomment-370973482, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGXx01jN_qX7nJBewPz2pMfhgth72iXks5tbyNagaJpZM4SMdrA .
@kevinushey, not as far as I know. It seems pretty difficult to extract any useful information besides that object
dtype unless you dig into the individual elements. You seem to have the same problem with character vectors though (shown as object
s on the python side), and py_ref_to_r()
seems to handle that at the C++ level. Not sure if you can do a similar thing here, or if you want to try a different route.
library(reticulate)
ex <- data.frame(x = "hello", y = Sys.Date(), z = 1, stringsAsFactors = FALSE)
ex_py <- r_to_py(ex)
ex_py$x$dtype
#> object
ex_py$y$dtype
#> object
ex_py$z$dtype
#> float64
class(ex_py$y[[0]])
#> [1] "datetime.date" "python.builtin.object"
As far as proactively converting Date
columns inside a data.frame to NumPy datetime64[ns]
columns inside a Pandas DF, I'm not against that. feather does this, as they didn't want to use datetime.date
at all because of performance. It probably means you won't be able to go R->Pandas->R and retain the Date
columns, but I'd be alright with that. Also, as noted earlier, using a POSIXct
column in the data.frame is already forced over to datetime64[ns]
so it's not like you're using datetime.datetime
right now with that either.
Two more date based issues.
1) Dropping of fractional seconds
library(reticulate)
ex <- data.frame(y = Sys.time())
ex$y <- lubridate::with_tz(ex$y, "UTC") # Dont think about time zones right now
ex
#> y
#> 1 2018-03-07 16:39:15
options(digits.secs = 6) # show fractional seconds
ex
#> y
#> 1 2018-03-07 16:39:15.372265
ex_py <- r_to_py(ex)
py_to_r(ex_py) # fractional seconds are dropped
#> y
#> 1 2018-03-07 16:39:15
I think you could support them by changing the recent commit of:
else if (inherits(column, "POSIXt")) {
np_array(as.numeric(column), dtype = "datetime64[s]")
}
to:
else if (inherits(column, "POSIXt")) {
np_array(as.numeric(column) * 1e9, dtype = "datetime64[ns]")
}
2) Support for time zones (Pandas defaults to UTC)
This one is likely tricky
Update: I see in py_to_r.numpy.ndarray
that you explicitly set the tz from Pandas to R to UTC
.
Update2: This looks difficult as it seems like datetime64[ns]
objects themselves might not actually retain a timezone "attribute" and only print based on the user's locale? But it looks like if it was set as the actual index like DatetimeIndex
then you can store a tz
attribute on that.
library(reticulate)
ex <- data.frame(y = Sys.time())
ex
#> y
#> 1 2018-03-07 11:55:01
# I live on the East coast
# This is what R uses if I don't specify a tz
ex$y
#> [1] "2018-03-07 11:55:01 EST"
# Pandas defaults to UTC
ex_py <- r_to_py(ex)
ex_py
#> y
#> 0 2018-03-07 16:55:01
# And we get UTC back
ex_back_to_r <- py_to_r(ex_py)
ex_back_to_r
#> y
#> 1 2018-03-07 16:55:01
ex_back_to_r$y
#> [1] "2018-03-07 16:55:01 UTC"
If feather
goes straight to datetime64[ns]
then I think we can feel free to do that as well.
I'll take a look at these today as well -- thanks for the testing!
Here is your proof and reasoning of feather
going straight to datetime64[ns]
just so you can take a look (I posted this way earlier but easily missable). Thanks for the hard work!
https://github.com/wesm/feather/issues/330
Thanks! We now try to use datetime64[ns]
throughout rather than Python's own datetime class.
For timezones, I'm curious what your opinion on the 'right' behavior is. My understanding is that timezones are only used for printing and formatting -- the underlying representation of the time (a POSIX timestamp) stays the same.
Should we try to preserve the UTC timezone as used by Python? Or just drop the attribute and print with local time?
tl;dr) I'd go with preserving UTC when coming from Python.
You are correct with stating that the underlying representation stays the same, and timezones are basically metadata for printing here.
I can come up with 2 main problems that you'll have to deal with off the top of my head.
1) The user has not set a tz 2) The user has set a tz
In both cases, at some point in the roundtrip the tz won't be correct (unless their local time is UTC or that is what they set as the tz). However, if we were to pick the "worst of two evils" for either preserving the UTC tz on the way back from Python or dropping the attribute and using local time, I think I'd go with the more predictable former of preserving UTC. I think that is the best case because there is a chance that if we leave it up to local time then the user could see 3 different time zones on a round trip (see Problem 2 in the code below). I think that would be more frustrating then just documenting that UTC is used for Pandas and just making them deal with that.
These problems mainly stem from the fact that it seems like we can't pass over the tz info to Pandas. It just seems like the datetime64[ns]
type doesn't "store" the tz unless it is a DatetimeIndex
, but I could be wrong.
@kevinushey Can we close this issue?
I think it can be closed. Thanks for all the feedback @DavisVaughan, please let us know if you bump into more issues!
I'm playing around with the
pendulum
python library, and wanted to write a wrapper for apendulum
Date
that transformed it into an RDate
.I discovered
py_to_r()
and the underlying S3,py_to_r_wrapper
and tried something like:Which nearly works, but RStudio crashes. The reason is because of
py_to_r()
copying over all the attributes from the original python object. It replaces the RDate
class with all of the python classes. Commenting out the attribute copying line removes the problem.I assume this is not intended? Or am I just doing something wrong?
Created on 2018-02-20 by the reprex package (v0.1.1.9000).