Open colearendt opened 5 years ago
Related to #265 ? Although a bit different because I am talking about native integer types, which differ in Python to R and thus do not have a one-to-one mapping (i.e. multiple Python integers map to the same R integer).
py_to_r()
uses a fully-qualified Python class name for dispatch, e.g.
library(reticulate)
py_run_string("x = 123")
main <- import_main(convert = FALSE)
class(main$x)
#> [1] "python.builtin.int" "python.builtin.object"
py_to_r.python.builtin.int <- function(x) {
message("hello!")
reticulate:::py_to_r.default(x)
}
py$x
#> hello!
#> [1] 123
Created on 2018-07-25 by the reprex package (v0.2.0).
That said, I'm not sure what the best solution here is. I think that whatever is done it should be stable and consistent (ie whether you get an R integer vector or a bit64
vector or whatever should be consistent and not depend on the values of the data you're transforming)
Maybe py_to_r()
could accept a list of converters, which we would dispatch to in preference to the built-in methods for conversion? E.g.
# user-defined conversion functions
converters <- list(python.builtin.int = function(x) { <makes a bit64 vector> })
# would call user-defined converter in preference to built-in one
py_to_r(py$x, converters = converters)
This way, at least the user could request a specific conversion type when the default doesn't suffice. Will have to think about this a bit more.
It looks like there is actually some inconsistency in how we handle integer type conversion. For scalars, integers are always converted to integers:
https://github.com/rstudio/reticulate/blob/master/src/python.cpp#L529-L553
So for PyLong scalars (which are probably 32-bits on 32-bit platforms and 64-bits on 64-bit platforms?) we will always coerce to R 32 bit integers, which has potential for lossage.
However, for NumPy arrays we are careful to convert wider integer types to double:
https://github.com/rstudio/reticulate/blob/master/src/python.cpp#L191-L238
So if we could know that data was a 64-bit integer we could explicitly force it to R double (as we already do for NumPy)
Was just on my way here to submit a similar issue. Adding in another example:
library(reticulate)
boto3 <- import("boto3")
session <- boto3$session$Session(profile_name = "x")
client <- session$client("athena")
res <- client$get_query_execution(QueryExecutionId = "x")
res$QueryExecution$Statistics$DataScannedInBytes
## 1814623667
vs
import boto3
session = boto3.session.Session(profile_name = "x")
client = session.client("athena")
res = client.get_query_execution(QueryExecutionId = "x")
res["QueryExecution"]["Statistics"]["DataScannedInBytes"]
## 104893838771
Caught it when I was trying to write a cost estimator. It's an int
in the dict
FWIW using a double is also dangerous since that only represents values up to 2^53 without loss of precision and that's definitely caused issues in the past (e.g. database keys overflowing storage).
From what I understand, Python has two types for representing integers:
int
, which is either a 32-bit or 64-bit signed integer type (depending on architecture),long
, which is an arbitrary precision Python integer-y type.E.g. on my 64bit machine:
>>> type(2 ** 62)
<type 'int'>
>>> type(2 ** 63)
<type 'long'>
I'm not sure what the right solution here is :-/ In theory converting Python int
s to R integer64
vectors would work but I'm not sure how well these function as stand-in replacement for native int
s (and I'm guessing there's lots of caveats in terms of what you can / can't do with integer64
vectors)
Aye.
A "for instance" is that ggplot2
hates integer64
's (I get bit by that alot with odbc
→dbplyr
query results from Athena.
But, now I'm rly reticent abt doing a great deal with reticulate
outside of numpy
"datasci" contexts without writing bridge modules for anything that returns numeric values that could cause this behavior (which really means I'd forego any use of R and just use Python to reduce complexity).
At the very least this current, "errant" behaviour (IMO) needs some top-level emphasis outside of a GH issue since I think quite a few folks assume it "just works" (like I had).
Honestly, the integer64
conversion should be fine. It's at least a proper/honest/accurate value then and any of us that work with "big numbers" already have our workarounds for when/how we use them. And, it'll be a good introduction to this issue for others who may not really grok it or have come across it.
So would the proposal be for us to depend on the integer64
package and just return those objects?
Even though some packages might not like seeing integer64
it certainly seems like a consideration that ultimately needs to be handled (this might force the issue a bit, which could be a good thing).
I think that's reasonable.
I also think that if we make this choice we might want to have a more global conversation about making bit64
feel more like a first-class citizen in e.g. the tidyverse and friends. That is, if we adopt integer64
here we should consider making that a more global decision and perhaps take an active role in smoothing over any rough edges that might come up in using integer64
vectors (when at all possible).
It's popping up in a few places. My experience (from building nanotime
around it) is pretty positive.
integer64
also has Hadley's blessing in the tidyverse so I think this is something we should consider exploring.
There is one thing to be aware of for reticulate
. Under the hood, integer64
objects are actually REALSXP
s, so we'd need to implement special handling for that class of objects so that our existing low-level coercion code doesn't do the wrong thing. (I think this decision was motivated by the fact that double
s are typically 64-bit, so stuffing a 64-bit integer into the same memory space is a sneaky but effective solution.)
Also worth noting that there are some cases where integer64
gives different results than expected; e.g.
> mean(as.integer64(1:2))
integer64
[1] 1
That said, I think it's overall worth taking this on assuming we don't step on any particularly egregious land mines in the switch over.
Consider
> mean(as.integer64(1:3))
integer64
[1] 2
>
though -- it "simply" insists that its type sticks. Might be an issue here, but as I recall, "documented" over there.
Right; bit64
even warns quite plainly on load:
WARNING semantics differ from integer
so while most things behave the same as one would expect for R integers, this isn't true for all operations -- so we definitely need to keep in mind that if we substitute integer64
for R's integer type that we may well run into trouble.
The CRAN page (https://cran.r-project.org/web/packages/bit64/index.html) even states:
WARNING: do not use them as replacement for 32bit integers, integer64 are not supported for subscripting by R-core and they have different semantics when combined with double, e.g. integer64 + double => integer64.
The lack of subset means that e.g. this doesn't work as expected:
suppressPackageStartupMessages(library(bit64))
df <- data.frame(x = 1, y = 2, z = 3)
df[as.integer64(2)]
#> data frame with 0 columns and 1 row
and I think this will be problematic :-/
I think we could get something a little bit better here with ALTREP. By default, we would allow the underlying representation to remain as 64-bit integers, but convert to R's own 32-bit integers as needed for e.g. indexing and so on. Other mathematical operations could use regular old S3 dispatch (as integer64
does) to perform the right kind of manipulations on the underlying 64-bit data.
For what it's worth, my issue is also related to representing database keys, which are normally stored as long / bigint. One thing I have been using as a workaround is to represent the key as a string, then use reticulate to pull it in, and then finally cast that to an integer64
So I have been trying to figure out a workaround for this one, but it seems like the python.builtin.int
converter is "cached" within dict, somehow? Is that accurate? Is there some incantation to make my python.builtin.int
converter global (i.e. apply even for children)?
library(reticulate)
py_to_r.python.builtin.int <- function(x) {
message("hello!")
x <- as.character(x)
x <- bit64::as.integer64(x)
}
reticulate::py_run_string('obj = 2435823760;')
py$obj
#> hello!
#> integer64
#> [1] 2435823760
reticulate::py_run_string('obj2 = {"key": 2435823760};')
py$obj2
#> $key
#> [1] -1859143536
# not a different class?
main <- import_main(convert = FALSE)
class(main$obj2)
#> [1] "python.builtin.dict" "python.builtin.object"
class(main$obj2$key)
#> [1] "python.builtin.int" "python.builtin.object"
# but now it fires?
main$obj2
#> hello!
#> {'key': 2435823760}
main$obj2$key
#> 2435823760
# just not on the py$ object
py$obj2$key
#> [1] -1859143536
Created on 2019-01-27 by the reprex package (v0.2.1)
This is because the default method py_to_r.default
calls py_ref_to_r
which does its own sort of conversion behind the scenes separate from the front-end S3 methods.
We could implement a py_to_r.python.builtin.dict
method but I think there's some worry it would be slow on very large dictionaries.
Interesting. Thanks Kevin. Is there any conceivable way for a user to work-around or modify this type of behavior? I.e. short of just rewriting all of these methods? I ask b/c I am working with an OOM in this case and have no qualms about working around the issue, but it feels like I'm sort of stuck atm. I tried poking at a dict method, but I felt very much out of my depth and it wasn't clear if that would even help :smile:
Maybe it would be just importing the main manually with convert = FALSE
?
I don't think there's any way to override this right now (short of registering your own python.builtin.dict
S3 method, which I don't think is a great idea)
Any further thinking on this one and what the best long-term approach is? I would love to have a nice way to resolve this, if possible 😄
I fear that because of the lack of binary representation on the R side you always need a special case / trick. R after all only has
int
at 32bit which we cannot use for lack of rangedouble
which has commensurate range (and more) but not precisionWe have the same problem with nanotime
which is "merely" between C(++) and R. There, we use double
as the "payload carrier" (as we need 64 bits to get the int64
across) but then once in R hand the 64 bits over to bit64
to interpret them correctly for us (as nanoseconds since epoch, getting us the full 19 (iirc) digits precision) we cannot get from double
.
We talked about an int64 backend for vctrs. It should work much better than bit64, for instance with mean()
. Split-combine operations such as vec_rbind()
, pmap()
etc will also do the right thing.
How might one go about implementing the soln that @eddelbuettel uses in nanotime? Lack of 64bit conversion all but prevents me from using reticulate daily
Took a swing at it @ #590.
I hope it is helpful for people who need a short term workaround while waiting #590 merged.
In general, we can convert a python object to JSON, then transfer it to R. It is not an idea solution for super long integer that require special handling to parse JSON string as discussed in https://stackoverflow.com/questions/65009389/parse-json-with-big-integer-in-r.
x_py <- reticulate::py_run_string('obj = [2435823760, 123, ["abc", 1234567890987654321]];', convert = FALSE)
x_json <- json[["dumps"]](x_py$obj)
jsonlite::fromJSON(x_json, simplifyVector = FALSE)
[[1]]
[1] 2435823760
[[2]]
[1] 123
[[3]]
[[3]][[1]]
[1] "abc"
[[3]][[2]]
[1] 1.234568e+18
I tried to trace
py_to_r
for integer conversion, but I have struggled to do so. I'm not sure whether this is something that could be implemented inreticulate
, or if I should just implement some type of S3 workaround in my code.In any case, I thought it would be desirable for reticulate to convert python integers that do not fit into 32bit integers into a 64-bit analog in R (i.e. the
bit64
package). Not sure if we are waiting for R to adopt a standard here - the Python conversion is super smooth for the objects I am working with minus these IDs that get destroyed by the 32bit issue.If I were to implement a workaround, overriding with a string conversion would work fine for IDs too. It looked like
int
dispatch may go throughpy_to_r.default
? Would a workaround implement a custompy_to_r.int
function? I may be missing something about how this should work.