rstudio / reticulate

R Interface to Python
https://rstudio.github.io/reticulate
Apache License 2.0
1.67k stars 327 forks source link

Incorrect conversion of some pandas dataframes resulting in memory addresses? #324

Open yjml opened 6 years ago

yjml commented 6 years ago

I'm having some issues with some pandas dataframes failing to make it over to the R side:

 Error in as.data.frame.default(x[[i]], optional = TRUE) : 
  cannot coerce class "c("decimal.Decimal", "python.builtin.object")" to a data.frame 

The origin of these is spark -> parquet stored on S3 > pyarrow.parquet with s3fs to read, along the lines of the following (the import() version fails as well)

library(reticulate)
runstr = sprintf('
import s3fs
import pyarrow.parquet as pq
fs = s3fs.S3FileSystem(key="%s", secret="%s")
pqds = pq.ParquetDataset("%s", filesystem=fs)
p = pqds.read().to_pandas()', awskey, 
                              awssecret, 
                              s3path)
py_run_string(runstr, convert = FALSE)
rdf = py_to_r(py$p)

Attached are three pickled subsets of the pandas dataframes - 2 with problems, 1 from the same pipeline that is successful. reticulate_pandadf.zip

library(reticulate)
pandas = import("pandas")
pandas$read_pickle("bad_pandadf.pickle")
    Error in as.data.frame.default(x[[i]], optional = TRUE) : 
      cannot coerce class "c("decimal.Decimal", "python.builtin.object")" to a data.frame

py_run_string('
import pandas
p1 = pandas.read_pickle("bad_pandadf.pickle")
p2 = pandas.read_pickle("bad_pandadf2.pickle")
p3 = pandas.read_pickle("good_pandadf.pickle")', convert = FALSE)

py$p1
    Error in as.data.frame.default(x[[i]], optional = TRUE) : 
      cannot coerce class "c("decimal.Decimal", "python.builtin.object")" to a data.frame
py$p2
    Error in as.data.frame.default(x[[i]], optional = TRUE) : 
      cannot coerce class "c("decimal.Decimal", "python.builtin.object")" to a data.frame
py$p3
    [ Results ] 

In particular, the problem appears to be the variable status in 1 and daysupp in 2 - deselecting these allows things to make it back over to Results

py_run_string('p1a =  p1[[col for col in p1.columns if col not in ["status"]]]', convert = FALSE)
py_run_string('p2a =  p2[[col for col in p2.columns if col not in ["daysupp"]]]', convert = FALSE)
py$p1a
    [ Results ] 
py$p2a
    [ Results ] 
yjml commented 6 years ago

Turns out my server's rstudio packages were out of date, was on 1.6

1.9 appears to be fine in getting these over to R, but things are still problematic - the data in the problematic columns are e.g. <environment: 0xc1957c0> instead of the anticipated contents e.g. 30

jwhendy commented 2 years ago

Is there any input on this? It's quite old, yet happening to me. I'm essentially brand new to reticulate, so I can't tell if this is expected or an actual issue... any input/clarification from someone who knows?

I tried two ways and ran into issues on both.

> head(py$df)
                A                                 B    C    D                                 E
1             abc <environment: 0x000001de915241c8>  123  123 <environment: 0x000001de8693ede0>
2             def <environment: 0x000001de9150ccd8>  123  123 <environment: 0x000001de86920278>
3              gh <environment: 0x000001de8e081ec8>  123  123                               NaN
4             ijk <environment: 0x000001de9220b050>  123  123                               NaN
5             lmn <environment: 0x000001de921dbe98>  123  123                               NaN

In looking at some of the list entries, I did note the use of Decimal().

[[1]]$Order
Decimal('32')

Thanks for any advice on how to figure this out and/or to confirm if this is intended. I have no issues running the python code that builds this dataframe directly, so it seems it's something about the handoff from python to R.

t-kalinowski commented 2 years ago

Thanks for bring this up again, this is on the backlog.

This makes me think what decimal.Decimal objects should convert to in R. R doubles doesn't seem quite right, but I'm not sure what's a better alternative.

jwhendy commented 2 years ago

@t-kalinowski thanks for taking a look! Honestly Decimal was completely new to me. I don't feel I have the chops to say much about this decision... but it did dawn on me that for my use-case, this isn't exactly "programmed" into dynamodb as a Decimal? It seems like an artifact of python, no?

I.e. in looking at AWS directly, my Order variable above is stored as a Number. So, again, I'm probably the wrong person to answer this as AWS, boto3, and reticulate are all pretty new to me... but if the worry is that R can't handle some of the fancy specifications of Decimal, it's not clear in what cases those fancy specifications could make it down given the original data.

Put another way, in what case does as.numeric(Decimal('32')) (or whatever the call would be) go wrong when the values are starting from an AWS container (again, in my use case)?

delabj commented 1 month ago

I found this thread while running into the same issue on AWS. I don't think I've seen a resolution or work around mentioned. Has this been resolved or discussed elsewhere?

t-kalinowski commented 1 month ago

If you have an R data.frame where one of the columns is an unconverted list of Decimal objects, you can convert it to an R float like this:

Decimals_to_numeric <- function(x) {
  py_float <- import_builtins()$float
  purrr::map_dbl(x, py_float)
}

df$col_of_Decimals <- Decimals_to_numeric(df$col_of_Decimals)