rstudio / reticulate

R Interface to Python
https://rstudio.github.io/reticulate
Apache License 2.0
1.65k stars 327 forks source link

Converting non-32-bit integers from python to R to use bit64 #323

Open colearendt opened 5 years ago

colearendt commented 5 years ago

I tried to trace py_to_r for integer conversion, but I have struggled to do so. I'm not sure whether this is something that could be implemented in reticulate, or if I should just implement some type of S3 workaround in my code.

In any case, I thought it would be desirable for reticulate to convert python integers that do not fit into 32bit integers into a 64-bit analog in R (i.e. the bit64 package). Not sure if we are waiting for R to adopt a standard here - the Python conversion is super smooth for the objects I am working with minus these IDs that get destroyed by the 32bit issue.

library(reticulate)                           
reticulate::py_run_string('obj = 2435823760;')
py$obj                                        
#> [1] -1859143536

# admittedly, this does not seem to be firing
# ignorance is likely the problem
py_to_r.int <- function(x){
    message('firing')
    NextMethod(generic="py_to_r", object = x)
}
py$obj                                        
#> [1] -1859143536

If I were to implement a workaround, overriding with a string conversion would work fine for IDs too. It looked like int dispatch may go through py_to_r.default? Would a workaround implement a custom py_to_r.int function? I may be missing something about how this should work.

colearendt commented 5 years ago

Related to #265 ? Although a bit different because I am talking about native integer types, which differ in Python to R and thus do not have a one-to-one mapping (i.e. multiple Python integers map to the same R integer).

kevinushey commented 5 years ago

py_to_r() uses a fully-qualified Python class name for dispatch, e.g.

library(reticulate)
py_run_string("x = 123")
main <- import_main(convert = FALSE)
class(main$x)
#> [1] "python.builtin.int"    "python.builtin.object"

py_to_r.python.builtin.int <- function(x) {
  message("hello!")
  reticulate:::py_to_r.default(x)
}

py$x
#> hello!
#> [1] 123
Created on 2018-07-25 by the reprex package (v0.2.0).

That said, I'm not sure what the best solution here is. I think that whatever is done it should be stable and consistent (ie whether you get an R integer vector or a bit64 vector or whatever should be consistent and not depend on the values of the data you're transforming)

Maybe py_to_r() could accept a list of converters, which we would dispatch to in preference to the built-in methods for conversion? E.g.

# user-defined conversion functions
converters <- list(python.builtin.int = function(x) { <makes a bit64 vector> })

# would call user-defined converter in preference to built-in one
py_to_r(py$x, converters = converters)

This way, at least the user could request a specific conversion type when the default doesn't suffice. Will have to think about this a bit more.

jjallaire commented 5 years ago

It looks like there is actually some inconsistency in how we handle integer type conversion. For scalars, integers are always converted to integers:

https://github.com/rstudio/reticulate/blob/master/src/python.cpp#L529-L553

So for PyLong scalars (which are probably 32-bits on 32-bit platforms and 64-bits on 64-bit platforms?) we will always coerce to R 32 bit integers, which has potential for lossage.

However, for NumPy arrays we are careful to convert wider integer types to double:

https://github.com/rstudio/reticulate/blob/master/src/python.cpp#L191-L238

So if we could know that data was a 64-bit integer we could explicitly force it to R double (as we already do for NumPy)

hrbrmstr commented 5 years ago

Was just on my way here to submit a similar issue. Adding in another example:

library(reticulate)

boto3 <- import("boto3")
session <- boto3$session$Session(profile_name = "x")
client <- session$client("athena")
res <- client$get_query_execution(QueryExecutionId = "x")

res$QueryExecution$Statistics$DataScannedInBytes
## 1814623667

vs

import boto3

session = boto3.session.Session(profile_name = "x")
client = session.client("athena")
res = client.get_query_execution(QueryExecutionId = "x")

res["QueryExecution"]["Statistics"]["DataScannedInBytes"]
## 104893838771

Caught it when I was trying to write a cost estimator. It's an int in the dict

kevinushey commented 5 years ago

FWIW using a double is also dangerous since that only represents values up to 2^53 without loss of precision and that's definitely caused issues in the past (e.g. database keys overflowing storage).

From what I understand, Python has two types for representing integers:

E.g. on my 64bit machine:

>>> type(2 ** 62)
<type 'int'>
>>> type(2 ** 63)
<type 'long'>

I'm not sure what the right solution here is :-/ In theory converting Python ints to R integer64 vectors would work but I'm not sure how well these function as stand-in replacement for native ints (and I'm guessing there's lots of caveats in terms of what you can / can't do with integer64 vectors)

hrbrmstr commented 5 years ago

Aye.

A "for instance" is that ggplot2 hates integer64's (I get bit by that alot with odbcdbplyr query results from Athena.

But, now I'm rly reticent abt doing a great deal with reticulate outside of numpy "datasci" contexts without writing bridge modules for anything that returns numeric values that could cause this behavior (which really means I'd forego any use of R and just use Python to reduce complexity).

At the very least this current, "errant" behaviour (IMO) needs some top-level emphasis outside of a GH issue since I think quite a few folks assume it "just works" (like I had).

Honestly, the integer64 conversion should be fine. It's at least a proper/honest/accurate value then and any of us that work with "big numbers" already have our workarounds for when/how we use them. And, it'll be a good introduction to this issue for others who may not really grok it or have come across it.

jjallaire commented 5 years ago

So would the proposal be for us to depend on the integer64 package and just return those objects?

Even though some packages might not like seeing integer64 it certainly seems like a consideration that ultimately needs to be handled (this might force the issue a bit, which could be a good thing).

kevinushey commented 5 years ago

I think that's reasonable.

I also think that if we make this choice we might want to have a more global conversation about making bit64 feel more like a first-class citizen in e.g. the tidyverse and friends. That is, if we adopt integer64 here we should consider making that a more global decision and perhaps take an active role in smoothing over any rough edges that might come up in using integer64 vectors (when at all possible).

eddelbuettel commented 5 years ago

It's popping up in a few places. My experience (from building nanotime around it) is pretty positive.

kevinushey commented 5 years ago

integer64 also has Hadley's blessing in the tidyverse so I think this is something we should consider exploring.

There is one thing to be aware of for reticulate. Under the hood, integer64 objects are actually REALSXPs, so we'd need to implement special handling for that class of objects so that our existing low-level coercion code doesn't do the wrong thing. (I think this decision was motivated by the fact that doubles are typically 64-bit, so stuffing a 64-bit integer into the same memory space is a sneaky but effective solution.)

Also worth noting that there are some cases where integer64 gives different results than expected; e.g.

> mean(as.integer64(1:2))
integer64
[1] 1

That said, I think it's overall worth taking this on assuming we don't step on any particularly egregious land mines in the switch over.

eddelbuettel commented 5 years ago

Consider

> mean(as.integer64(1:3))
integer64
[1] 2
> 

though -- it "simply" insists that its type sticks. Might be an issue here, but as I recall, "documented" over there.

kevinushey commented 5 years ago

Right; bit64 even warns quite plainly on load:

WARNING semantics differ from integer

so while most things behave the same as one would expect for R integers, this isn't true for all operations -- so we definitely need to keep in mind that if we substitute integer64 for R's integer type that we may well run into trouble.

kevinushey commented 5 years ago

The CRAN page (https://cran.r-project.org/web/packages/bit64/index.html) even states:

WARNING: do not use them as replacement for 32bit integers, integer64 are not supported for subscripting by R-core and they have different semantics when combined with double, e.g. integer64 + double => integer64.

kevinushey commented 5 years ago

The lack of subset means that e.g. this doesn't work as expected:

suppressPackageStartupMessages(library(bit64))
df <- data.frame(x = 1, y = 2, z = 3)
df[as.integer64(2)]
#> data frame with 0 columns and 1 row

and I think this will be problematic :-/

kevinushey commented 5 years ago

I think we could get something a little bit better here with ALTREP. By default, we would allow the underlying representation to remain as 64-bit integers, but convert to R's own 32-bit integers as needed for e.g. indexing and so on. Other mathematical operations could use regular old S3 dispatch (as integer64 does) to perform the right kind of manipulations on the underlying 64-bit data.

jeffwong-nflx commented 5 years ago

For what it's worth, my issue is also related to representing database keys, which are normally stored as long / bigint. One thing I have been using as a workaround is to represent the key as a string, then use reticulate to pull it in, and then finally cast that to an integer64

colearendt commented 5 years ago

So I have been trying to figure out a workaround for this one, but it seems like the python.builtin.int converter is "cached" within dict, somehow? Is that accurate? Is there some incantation to make my python.builtin.int converter global (i.e. apply even for children)?

library(reticulate)

py_to_r.python.builtin.int <- function(x) {
  message("hello!")
  x <- as.character(x)
  x <- bit64::as.integer64(x)
}

reticulate::py_run_string('obj = 2435823760;')
py$obj
#> hello!
#> integer64
#> [1] 2435823760

reticulate::py_run_string('obj2 = {"key": 2435823760};')
py$obj2
#> $key
#> [1] -1859143536

# not a different class?
main <- import_main(convert = FALSE)
class(main$obj2)
#> [1] "python.builtin.dict"   "python.builtin.object"
class(main$obj2$key)
#> [1] "python.builtin.int"    "python.builtin.object"

# but now it fires?
main$obj2
#> hello!
#> {'key': 2435823760}
main$obj2$key
#> 2435823760

# just not on the py$ object
py$obj2$key
#> [1] -1859143536

Created on 2019-01-27 by the reprex package (v0.2.1)

kevinushey commented 5 years ago

This is because the default method py_to_r.default calls py_ref_to_r which does its own sort of conversion behind the scenes separate from the front-end S3 methods.

We could implement a py_to_r.python.builtin.dict method but I think there's some worry it would be slow on very large dictionaries.

colearendt commented 5 years ago

Interesting. Thanks Kevin. Is there any conceivable way for a user to work-around or modify this type of behavior? I.e. short of just rewriting all of these methods? I ask b/c I am working with an OOM in this case and have no qualms about working around the issue, but it feels like I'm sort of stuck atm. I tried poking at a dict method, but I felt very much out of my depth and it wasn't clear if that would even help :smile:

Maybe it would be just importing the main manually with convert = FALSE?

kevinushey commented 5 years ago

I don't think there's any way to override this right now (short of registering your own python.builtin.dict S3 method, which I don't think is a great idea)

colearendt commented 5 years ago

Any further thinking on this one and what the best long-term approach is? I would love to have a nice way to resolve this, if possible 😄

eddelbuettel commented 5 years ago

I fear that because of the lack of binary representation on the R side you always need a special case / trick. R after all only has

We have the same problem with nanotime which is "merely" between C(++) and R. There, we use double as the "payload carrier" (as we need 64 bits to get the int64 across) but then once in R hand the 64 bits over to bit64 to interpret them correctly for us (as nanoseconds since epoch, getting us the full 19 (iirc) digits precision) we cannot get from double.

lionel- commented 5 years ago

We talked about an int64 backend for vctrs. It should work much better than bit64, for instance with mean(). Split-combine operations such as vec_rbind(), pmap() etc will also do the right thing.

kdkavanagh commented 4 years ago

How might one go about implementing the soln that @eddelbuettel uses in nanotime? Lack of 64bit conversion all but prevents me from using reticulate daily

kdkavanagh commented 4 years ago

Took a swing at it @ #590.

fb-elong commented 2 years ago

I hope it is helpful for people who need a short term workaround while waiting #590 merged.

In general, we can convert a python object to JSON, then transfer it to R. It is not an idea solution for super long integer that require special handling to parse JSON string as discussed in https://stackoverflow.com/questions/65009389/parse-json-with-big-integer-in-r.

Input

x_py <- reticulate::py_run_string('obj = [2435823760, 123, ["abc", 1234567890987654321]];', convert = FALSE)

x_json <- json[["dumps"]](x_py$obj)

jsonlite::fromJSON(x_json, simplifyVector = FALSE)

Output

[[1]]
[1] 2435823760

[[2]]
[1] 123

[[3]]
[[3]][[1]]
[1] "abc"

[[3]][[2]]
[1] 1.234568e+18