rstudio / reticulate

R Interface to Python
https://rstudio.github.io/reticulate
Apache License 2.0
1.68k stars 328 forks source link

Pandas DataFrame and R data.frame translation #2

Closed seandavi closed 7 years ago

seandavi commented 7 years ago

Wonderful work so far! Just dreaming of the future...

jjallaire commented 7 years ago

If Pandas data frames can be decomposed into NumPy arrays then it should be possible to do this without too much more invention.

eddelbuettel commented 7 years ago

Well ... when I once needed to get NumPy into R the only way to do was to wrap an external (C) library as part of RcppCNPy. Maybe there are better ways now, I'd be eager to learn about them either way.

terrytangyuan commented 7 years ago

@jjallaire pandas.DataFrame.as_matrix is the trick

jjallaire commented 7 years ago

As has been pointed out, right now this is possible by decomposing the R and/or Pandas data frame into vector / matrixes. I think this is all we will do for the foreseeable future, as fully handling data frames will involve dealing with character vectors, dates, list columns, etc. and end up being too large of a project.

terrytangyuan commented 7 years ago

Makes sense. The handling for those would be very hard to maintain.

saurfang commented 7 years ago

Feather / Apache Arrow being a data frame serialization framework that supports both R and Python could be useful here. https://github.com/wesm/feather

I took a stab. Any suggestions are welcomed. I have not extensively tested it yet. https://github.com/saurfang/reticulate.df

jjallaire commented 7 years ago

Yes, feather definitely has all of the bits required to do this sorted out. Our plan is to add data frame support to reticulate using the same techniques as feather (sharing code if possible), but not to require a full serialize/deserialize to disk to do the conversion.

saurfang commented 7 years ago

That's very exciting. Specifically, are you talking about converting pandas DataFrame to Apache Arrow format (or something similar) in a memory buffer, and reading that into R via Rcpp (to avoid disk serialization/compression and memory copy)? or would this be a more ambitious implementation of a data.frame backend that lives in external memory entirely?

Any rough timeline that you might be working and releasing this?

jjallaire commented 7 years ago

Hopefully by the end of this year.

On Sun, Jul 16, 2017 at 6:24 PM, Forest Fang notifications@github.com wrote:

That's very exciting. Specifically, are you talking about converting pandas DataFrame to Apache Arrow format (or something similar) in a memory buffer, and reading that into R via Rcpp (to avoid disk serialization/compression and memory copy)? or would this be a more ambitious implementation of a data.frame backend that lives in external memory entirely?

Any rough timeline that you might be working and releasing this?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/rstudio/reticulate/issues/2#issuecomment-315642378, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGXx3sy7UnmcVktZKuJH0OIDqmyEggyks5sOo22gaJpZM4L60Xi .

shearerpmm commented 6 years ago

Did this end up working? I'm interested if the R-Python communication can be done with large dataframes without expensive ser/de

jjallaire commented 6 years ago

Yes, this is now available: https://rstudio.github.io/reticulate/articles/calling_python.html#data-frames

shearerpmm commented 6 years ago

Neat! Am I to understand that the discussion around arrays (no copies needed) also applies to dataframes?

jjallaire commented 6 years ago

No, Pandas data frames created from NumPy arrays automatically make copies of the arrays. So there "no copy" going from R vector to NumPy array but there is ultimately a copy made by Pandas.

shearerpmm commented 6 years ago

But that Pandas copy is memory-to-memory, so at least the disk is never involved?

jjallaire commented 6 years ago

Correct, these are all very fast memory to memory copies of contiguous vectors.

ParissaM commented 11 months ago

Hello,

For my case I was reading the data from an S3 bucket and I had the same issue. what helped for me was adding in the Python function the following snippet:

for c in df.columns:
    df[c] = np.array(df[c].values)

So the Python function would look like this:

def get_data_from_db(db_name, query):
    df = wr.athena.read_sql_query(
    sql=query,
    database=db_name,
    ctas_approach=False
    )
    for c in df.columns:
        df[c] = np.array(df[c].values) 
   return df

Hope this helps, Regards