Closed seandavi closed 7 years ago
If Pandas data frames can be decomposed into NumPy arrays then it should be possible to do this without too much more invention.
Well ... when I once needed to get NumPy into R the only way to do was to wrap an external (C) library as part of RcppCNPy. Maybe there are better ways now, I'd be eager to learn about them either way.
@jjallaire pandas.DataFrame.as_matrix
is the trick
As has been pointed out, right now this is possible by decomposing the R and/or Pandas data frame into vector / matrixes. I think this is all we will do for the foreseeable future, as fully handling data frames will involve dealing with character vectors, dates, list columns, etc. and end up being too large of a project.
Makes sense. The handling for those would be very hard to maintain.
Feather / Apache Arrow being a data frame serialization framework that supports both R and Python could be useful here. https://github.com/wesm/feather
I took a stab. Any suggestions are welcomed. I have not extensively tested it yet. https://github.com/saurfang/reticulate.df
Yes, feather definitely has all of the bits required to do this sorted out. Our plan is to add data frame support to reticulate using the same techniques as feather (sharing code if possible), but not to require a full serialize/deserialize to disk to do the conversion.
That's very exciting. Specifically, are you talking about converting pandas DataFrame to Apache Arrow format (or something similar) in a memory buffer, and reading that into R via Rcpp (to avoid disk serialization/compression and memory copy)? or would this be a more ambitious implementation of a data.frame backend that lives in external memory entirely?
Any rough timeline that you might be working and releasing this?
Hopefully by the end of this year.
On Sun, Jul 16, 2017 at 6:24 PM, Forest Fang notifications@github.com wrote:
That's very exciting. Specifically, are you talking about converting pandas DataFrame to Apache Arrow format (or something similar) in a memory buffer, and reading that into R via Rcpp (to avoid disk serialization/compression and memory copy)? or would this be a more ambitious implementation of a data.frame backend that lives in external memory entirely?
Any rough timeline that you might be working and releasing this?
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/rstudio/reticulate/issues/2#issuecomment-315642378, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGXx3sy7UnmcVktZKuJH0OIDqmyEggyks5sOo22gaJpZM4L60Xi .
Did this end up working? I'm interested if the R-Python communication can be done with large dataframes without expensive ser/de
Yes, this is now available: https://rstudio.github.io/reticulate/articles/calling_python.html#data-frames
Neat! Am I to understand that the discussion around arrays (no copies needed) also applies to dataframes?
No, Pandas data frames created from NumPy arrays automatically make copies of the arrays. So there "no copy" going from R vector to NumPy array but there is ultimately a copy made by Pandas.
But that Pandas copy is memory-to-memory, so at least the disk is never involved?
Correct, these are all very fast memory to memory copies of contiguous vectors.
Hello,
For my case I was reading the data from an S3 bucket and I had the same issue. what helped for me was adding in the Python function the following snippet:
for c in df.columns:
df[c] = np.array(df[c].values)
So the Python function would look like this:
def get_data_from_db(db_name, query):
df = wr.athena.read_sql_query(
sql=query,
database=db_name,
ctas_approach=False
)
for c in df.columns:
df[c] = np.array(df[c].values)
return df
Hope this helps, Regards
Wonderful work so far! Just dreaming of the future...