pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.73k stars 17.94k forks source link

ENH: IO support for R data files with `pandas.read_rdata` and `DataFrame.to_rdata` #40287

Open ParfaitG opened 3 years ago

ParfaitG commented 3 years ago

Currently, Pandas IO tools for binary files support largely the commercial statistical packages (SAS, Stata, SPSS). Interestingly, R binary types (.rds, .rda) are not included. Since many data science teams work between the open source stacks, some IO pandas support of R data files may be worthwhile to pursue.

I know there is some history of pandas with rpy2. However, there may be a way to integrate an IO module for R data files without optional dependency (i.e, pyreadr) but using a lightweight C library: librdata. Also, R's saveRDS uses compression types (gzip, bzip2, and xz) already handled with pandas io.

Thanks to the authors of pyreadr and librdata (not unlike the sas7bdat authors for read_sas or PyDTA authors for read_stata), I was able to implement a demo on an uncompressed rds type.

R

set.seed(332020)
alpha <- c(LETTERS, letters, c(0:9))
data_tools <- c("sas", "stata", "spss", "python", "r", "julia")

random_df <- data.frame(
  group = factor(sample(data_tools, 500, replace=TRUE)),
  int = sample(1:15, 500, replace=TRUE),
  num = rnorm(500),
  char = replicate(500, paste(sample(alpha, 3, replace=TRUE), collapse="")),
  bool = sample(c(TRUE, FALSE), 500, replace=TRUE),
  date = sample(seq(as.Date("2001-01-01"), Sys.Date(), by="days"), 500, replace=TRUE),
  ts= as.POSIXct(sample(1577836800:as.integer(Sys.time()), 500, replace=TRUE), origin="1970-01-01")
)

head(random_df, 5)
#    group int        num char  bool       date                  ts
# 1      r  15  1.2012638  3kp  TRUE 2006-12-07 2020-07-04 21:06:23
# 2   spss   5  0.9627570  rf6  TRUE 2010-10-28 2020-03-22 02:04:45
# 3  julia   9 -0.7922929  t9Q FALSE 2003-01-02 2021-02-11 06:33:02
# 4  julia  14 -0.4305794  zWw  TRUE 2004-11-24 2020-01-25 11:24:20
# 5 python   5 -1.0262956  LqK  TRUE 2020-08-20 2021-01-15 23:22:05

tail(random_df, 5)
#      group int        num char  bool       date                  ts
# 496 python   2  1.1038483  xGh  TRUE 2014-01-14 2020-11-30 11:30:00
# 497    sas   1 -0.5588906  BbC  TRUE 2011-03-15 2020-11-03 06:30:08
# 498    sas   1  0.3989181  FHi  TRUE 2003-03-02 2020-01-16 08:23:56
# 499 python   1 -0.7840641  e3k FALSE 2001-10-23 2020-05-27 06:53:21
# 500   spss   7  0.2351526  Klv  TRUE 2002-05-31 2020-05-28 11:42:56

### RDS
saveRDS(random_df, "/path/to/r_df.rds", compress=FALSE)

### RDA
save(random_df, "/path/to/r_df.rda", compress=FALSE)

Python (using a Cython built module)

Parser

from rparser import Parser, Writer           # slight adjustments to pyreadr's pyx w/o 3rd party imports
import pandas as pd
from datetime import datetime as dt, timedelta

class BaseRParser(Parser):
    """
    Parses the RData or Rds file using the parser defined
    in librdata.pyx which in turn uses the C API of librdata.
    """

    def __init__(self):
        self.counter = 0
        self.is_factor = False
        self.col_names = {}
        self.row_names = {}
        self.col_types = {}
        self.col_data = {}
        self.text_vals = {}
        self.value_labels = {}

    def handle_table(self, name):
        """
        Every object in the file is called table, this method is evoked once per object.
        :param name: str: the name of the table
        """
        pass

    def handle_column(self, name, data_type, data, count):
        """
        Evoked once per each column in the table.
        :param name: str: column name, may be None
        :param data_type: object of type DataType(Enum) (defined in librdata.pyx)
        :param data: a dictionary containing the data in R vector, may be empty
        :param count: int: number of elements in the array
        """

        self.row_count = count

        if self.is_factor:
            data = {k:self.value_labels[v] for k,v in data.items()}
            self.is_factor = False

        if data_type == "bool":
            data = {k:True if v==1 else False for k,v in data.items()}

        if data_type == "date":
            data = {k:dt(1970,1,1,0,0) + timedelta(v) for k,v in data.items()}        

        if data_type == "datetime":
            data = {k:dt.fromtimestamp(v) for k,v in data.items()}        

        self.col_data = {**self.col_data, **{self.counter: data}}
        self.col_types[self.counter] = data_type
        self.counter += 1

    def handle_column_name(self, name, index):
        """
        Some times name is None in handle column but it is recovered with this method.
        :param name: str: name of the column
        :param index: int: index of the column
        """

        self.col_names[index] = name

        if index == (self.counter - 1):
            self.compile_dataframe()

    def handle_dim(self, name, data_type, data, count):
        """
        Evoked once to retrieve the number of dimensions
        :param name: str: column name, may be None
        :param data_type: object of type DataType(Enum) (defined in librdata.pyx)
        :param data: a numpy array representing the number of dimensions
        :param count: int: number of elements in the array
        """
        pass

    def handle_dim_name(self, name, index):
        """
        Get one dimension name, one at a time, for matrices, arrays, tables.
        :param name: str: name of the dimension
        :param index: int: index of the dimension
        """
        pass

    def handle_row_name(self, name, index):
        """
        Handles R dataframe's rownames
        :param name: str: name of the row
        :param index: int: index of the row
        """

        self.row_names[index] = name

    def handle_text_value(self, name, index):
        """
        For character vectors this will be called once per row and will 
        retrieve the string value for that row.
        :param name: str: string value for the row
        :param index: int: index of the row
        :return:
        """

        self.text_vals[index] = name
        if index == (self.row_count - 1):
            self.col_data = {**self.col_data, **{self.counter-1: self.text_vals}}
            self.text_vals = {} 

    def handle_value_label(self, name, index):
        """
        Factors are represented as integer vectors.
        For factors, this method is called before reading the integer data 
        in the Factor column with handle_column and will give all the 
        string values corresponding to the integer values.
        :param name: str: string value
        :param index: int: integer value
        :return:
        """

        self.is_factor = True
        self.value_labels[index+1] = name

    def compile_dataframe(self):
        df_data = {n:v for (k,v),n in zip(self.col_data.items(), list(self.col_names.values()))}

        self.r_dataframe = pd.DataFrame(df_data) 
parser = BaseRParser()
parser.read_rds("/path/to/r_df.rds")
parser.read_rds("/path/to/r_df.rda")

py_df = parser.r_dataframe
print(py_df)
#       group  int       num char   bool       date                  ts
# 0         r   15  1.201264  3kp   True 2006-12-07 2020-07-04 21:06:23
# 1      spss    5  0.962757  rf6   True 2010-10-28 2020-03-22 02:04:45
# 2     julia    9 -0.792293  t9Q  False 2003-01-02 2021-02-11 06:33:02
# 3     julia   14 -0.430579  zWw   True 2004-11-24 2020-01-25 11:24:20
# 4    python    5 -1.026296  LqK   True 2020-08-20 2021-01-15 23:22:05
# ..      ...  ...       ...  ...    ...        ...                 ...
# 495  python    2  1.103848  xGh   True 2014-01-14 2020-11-30 11:30:00
# 496     sas    1 -0.558891  BbC   True 2011-03-15 2020-11-03 06:30:08
# 497     sas    1  0.398918  FHi   True 2003-03-02 2020-01-16 08:23:56
# 498  python    1 -0.784064  e3k  False 2001-10-23 2020-05-27 06:53:21
# 499    spss    7  0.235153  Klv   True 2002-05-31 2020-05-28 11:42:56

# [500 rows x 7 columns]

Writer

class BaseRWriter(Writer):
    """
    Writes pandas data to the RData or Rds file using the writer defined 
    in librdata.pyx which in turn uses the C API of librdata.
    """

    def write_rds(self, path=None, file_format="rds", frame=None, frame_name=None):
        """
        write a RData or Rds file. 
        path: str: path to the file
        file_format: str: rdata or rds
        df: pandas data frame
        df_name = name of the object to write. Irrelevant if rds format.
        """ 

        py_to_r_types = {
            "int64": "INTEGER",
            "float64": "REAL",
            "bool": "LOGICAL",
            "object": "CHARACTER",
            "datetime64[ns]": "TIMESTAMP"
        }

        self.open(path, file_format)
        self.set_row_count(frame.shape[0])
        self.set_table_name(frame_name)

        for col_name, col_type in zip(frame.columns.tolist(), frame.dtypes.tolist()):
            cur_type = py_to_r_types[str(col_type)]

            self.add_column(str(col_name), str(cur_type))

        for col in frame.select_dtypes(include=['datetime']).columns:
            frame[col] = frame[col].dt.tz_localize("America/Chicago").astype(int).div(10**9)

        for col_indx,((col,ser),dtype) in enumerate(zip(frame.to_dict().items(), frame.dtypes.tolist())):
            cur_type = py_to_r_types[str(dtype)]

            for row_indx, (indx, val) in enumerate(ser.items()):
                self.insert_value(row_indx, col_indx, val, cur_type)

        self.close()
writer = BaseRWriter()
writer.write_rds(
    path="/path/to/py_df.rds", 
    file_format="rds",
    frame=py_df, 
    frame_name="pandas_df"
)

R

py_df <- readRDS("/path/to//py_df.rds")

str(py_df)
# 'data.frame': 500 obs. of  7 variables:
#  $ group: chr  "r" "spss" "julia" "julia" ...
#  $ int  : int  15 5 9 14 5 9 7 7 5 10 ...
#  $ num  : num  1.201 0.963 -0.792 -0.431 -1.026 ...
#  $ char : chr  "3kp" "rf6" "t9Q" "zWw" ...
#  $ bool : logi  TRUE TRUE FALSE TRUE TRUE TRUE ...
#  $ date : 'POSIXct' num  2006-12-07 2010-10-28 2003-01-02 2004-11-24 2020-08-20 ...
#  $ ts   : 'POSIXct' num  2020-07-04 21:06:23 2020-03-22 02:04:45 2021-02-11 06:33:02 2020-01-25 11:24:20 2021-01-15 23:22:05 ...
#  - attr(*, "datalabel")= chr "pandas_df"
#  - attr(*, "var.labels")= chr [1:7] "" "" "" "" ...

head(py_df, 5)
#    group int        num char  bool       date                  ts
# 1      r  15  1.2012638  3kp  TRUE 2006-12-07 2020-07-04 21:06:23
# 2   spss   5  0.9627570  rf6  TRUE 2010-10-28 2020-03-22 02:04:45
# 3  julia   9 -0.7922929  t9Q FALSE 2003-01-02 2021-02-11 06:33:02
# 4  julia  14 -0.4305794  zWw  TRUE 2004-11-24 2020-01-25 11:24:20
# 5 python   5 -1.0262956  LqK  TRUE 2020-08-20 2021-01-15 23:22:05

tail(py_df, 5)
#      group int        num char  bool       date                  ts
# 496 python   2  1.1038483  xGh  TRUE 2014-01-14 2020-11-30 11:30:00
# 497    sas   1 -0.5588906  BbC  TRUE 2011-03-15 2020-11-03 06:30:08
# 498    sas   1  0.3989181  FHi  TRUE 2003-03-02 2020-01-16 08:23:56
# 499 python   1 -0.7840641  e3k FALSE 2001-10-23 2020-05-27 06:53:21
# 500   spss   7  0.2351526  Klv  TRUE 2002-05-31 2020-05-28 11:42:56
bashtage commented 3 years ago

Are R data files commonly used to data exchange? One of the arguments for SAS and Stata are that is it unfortunately common to see organizations publishing datasets in these formats. I can't see I've ever seen RDS used in this way.

dsaxton commented 3 years ago

@ParfaitG Interesting proposal. We would have to be a bit careful with naming here, as read_rds could very easily lead to confusion with AWS RDS.

ParfaitG commented 3 years ago

Good point @bashtage! Perhaps since R is a programming language or environment and not traditional software with proprietary types, the .rds format is not traditionally used in data exchange. Raw data and code would be enough to reproduce end use data.

However, anecdotally among internal teams, many advanced useRs use this format to save cleaned data, plotting data, modeling results, etc. due to its efficiency as a binary and compressed serialization type without need to parse text files and detect types. Also, the .rda format is the dominant data storage format in R packages. And this need is routinely asked on StackOverflow usually for rpy2 (requires R installed) or pyreadr solutions. A convenience pandas handler can help the open source ecosystem.

@dsaxton, I didn't think of that name collision. We can call it read_rdata especially since the aforementioned C library can read and write .rds (a single R object) and .rda (multiple R objects) types.

ParfaitG commented 3 years ago

To underscore the regularly asked need on StackOverflow throughout the years:

bashtage commented 3 years ago

I think supporting R data files is reasonable. The big question would be how shoudl support be added. Would it be better to take a soft dep on pyreadr like pandas does for most IO (e.g., openpyxl for Excel)? This way it will work as expected if this library is available. It saves the cost of maintaining a vendored code snippet and keeping it synced upstream. The downside is that new releases on a soft dep can break CI.

ParfaitG commented 3 years ago

Thanks, @bashtage, my thoughts it to bypass pyreadr altogether which is simply a Python wrapper to the C library, librdata. We only need to borrow and moderately adjust their Cython code (.pxd and .pyx), crediting authors.

Specifically, the pandas plan would include for no external dependencies:

  1. Store librdata's C src files (12 files at 102 kb) either in pandas._lib or pandas.io.rdata, crediting author. (Rarely will this library need update as rds and rda are sacrosanct R core types.)
  2. Cythonize with .pxd and .pyx for the r_parser module (i.e., pandas.io.sas cythonizes its _sas module).
  3. Integrate r_parser in new pandas.io.rdata module as demonstrated above (with updated results).

Now, would the pandas team be open to a new io C extension? r_parser.so builds at same size as _sas.so (~1 mb).

ofajardo commented 3 years ago

pyreadr developer here. I personally would suggest to use pyreadr as soft dep. It is not correct that rds and rda formats do not change ,they do with major and minor versions of R and these changes are undocumented (see for example here, here, here). And we are still improving as we cannot read all existing features, as again everything is undocumented. That means if you do do your own code base you will have to maintain it (maintenance would be completely on your side since I don't have capacity to maimtain two code bases)

I also develop pyreadstat, pandas is using it as soft dep for read_spss and that approach seems to be working really well.

Of course pyreadr is an opensource project, so you are free take the code. However take into account that the license of pyreadr is very restrictive, I am not sure what kind of license pandas has, but you have to ensure that the restrictions for these pieces of code, even if they become detached from pyreadr stay as strict as they are now. You will also need to distribe the pyreadr license and attached licenses toghether with pandas license. I will also ask you to do the first commit with my github handle so that I appear as contributor to the repo.

ParfaitG commented 3 years ago

Thank you, @ofajardo, for your input! First, do be aware you can take the lead on a PR for this proposed IO module. As an author who relies on pandas, why not become an original contributor? If using soft dep approach, you can follow the similar setup of pandas.io.spss.

Given your response, here are my thoughts:

With that said, thank you for authoring various data exchange packages in the pandas ecosystem over the years! From SO posts above, many have been grateful. I am looking into other solutions to build this specific IO support and I may have a different approach in mind.

bashtage commented 3 years ago

Circular Imports: If using pyreadr as soft dep, we will be building a pandas module that imports a package that imports pandas. Many of the imported modules used in other pandas IO modules (beautifulsoup, lxml, openpyxl, etc.) do not import pandas or serve as wrapper to pandas. The contributor to read_spss may not have known about the underlying C library and proposed a packaged solution. I had hoped to streamline this with direct connection to lower level librdata.

It is important to acknowledge that there is a non-trivial developer cost to streamlining. There are three options here:

  1. Do nothing and redirect users to pyreadr. This only requires a one-off documentation change
  2. Soft depencency on pyread r. Pretty minimal code changes, and as robust as pyreadr is still maintained
  3. Vendor the C library. Requires maintenance is in most cases will only work in pyreadr also works if there is a change that requires altering the underlying library code.

Dependency Restriction: One of the great aspects of pandas is that it is a full service solution right off the shelf.

More IO formats do than don't. An incomplete list of formats that require a soft dep:

Avoid Redundancy: Pandas IO API, mature now for years, includes a suite of IO handling including reading from URL, FTP, storage options like Amazon S3 buckets, file-like objects, buffers, and various compression types with likely additional future features. Additionally, pandas DataFrame API can handle timezone handling, data type migration, categorical dtype, and other nuanced needs with underlying core functions not available to pandas package end users. I had hoped to avoid this overlap of functionality and ensure consistency across other pandas IO modules.

I don't see how this argues against a soft dependency. pandas could take it as a soft dep and still provide a uniform API on top, including building any missing features, or converting between what the dep prefers and what pandas prefers.

Librdata Functionality: As mentioned earlier, rds and rda files rarely (not never) changes.

There have been about a page full of commits in the past year. If these are all necessary then it seems to drift around a bit.

  • PEP Standards: Pandas mostly stays current on Python's PEP standards across its functionality including type hinting/annotation, avoidance of relative imports, etc. A brief review of pyreadr does not indicate adherence to current PEP standards. I also see some need to update older uses such as OrderedDict (may not be needed for Python 3.6+ recommended for pandas users) and % operator for string formatting (de-emphasized in Python 3 but not deprecated yet). Possibly, too, code base can integrate pythonic semantics like list/dict comprehensions, generators, f-strings, and others. With rigorous quality standards of pandas using mypy, styling, and CI testing, I had hoped to ensure a robust rdata IO module.

I'm don't think this is much of an argument against the soft dep approach. Each package is allowed to have their own accepted code style. NumPy is pretty far from "full" PEP yet no one suggests not building on NumPy.

bashtage commented 3 years ago

You also seem to acknowledge pyreadr in your Cython above. You cannot use any code from pyreadr since it is GPL. A vendored version will need to have a clean-sheet implementation that directly wraps the C library without using code from pyreadr.

ofajardo commented 3 years ago

hey @ParfaitG thanks for your thoughtful answer!

I am still aligned with @bashtage thinking that a soft dep is better in this (and other io) case(s); and in general that modular is better than monolithic. It seems that there are enough successful examples of this approach in pandas as @bashtage has pointed out to demonstrate the approach works very well ... But, that's just my humble opinion, and I am not a pandas dev, so up to you guys to decide!

In case you guys would like to go for a soft dep, you got my full collaboration to do changes in pyreadr to align and better integrate with pandas, including cleaning the code to make it more PEP conforming; either doing my self or accepting PRs from others.

As @bashtage suggests, notice that the license of pyreadr is AGPL so it probably clashes with pandas and indeed you cannot take it. But doing a better wrapper for librdata from scratch (or some other approach as you mentioned) should be no issue for you in case you guys decide to go for an internal module.

Just a couple of other comments:

pyreadr limits the use of librdata to its needs, assumptions, and abilities (i.e., timezone, rownames). I had hoped to utilize the full functionality with direct access to librdata.

I actually am using the full libradata API trying to be as comprehensive as possible. Librdata has currently a lot of limitations you won't be able to overcome unless you directly contribute to librdata C code. If you include librdata as hard dep you will start getting issues around R lists not read, S4 objects not read, etc (just check pyreadr and librdata issues to see what I mean). I currently don't have capacity to work on those issues, but if you do, and you fix those things in librdata + your internal module, that would be a step forward! An in case you decide for a soft rep and have ideas on how to improve pyreadr and would like to contribute, you would be very welcome!

However if you truly want to become in full control of the process and overcome the current limitations imposed by librdata you should consider writing the convertor truly from scratch without relying on librdata.

I also see some need to update older uses such as OrderedDict

I decided to support older versions of python as much as possible, but I understand your dissagreement with that. What I see in reality is that we do have old production servers with old centos which are still running with python 3.5 and 3.4, from there the motivation to keep backward compatibility at expense of PEP styling.

Surely, as open source stacks, we can better integrate a data exchange solution between Python and R?

I hope that data exchange between Python, R and other technologies happen through libraries designed for that purpose such as feather/arrow. R binary files are undocumented and fully supported only by R having very low interoperabiliy, so in my opinion is a very poor solution for data exchange.

ParfaitG commented 3 years ago

@ofajardo, FYI - re documentation, see CRAN doc R Internals (updated 2021-03-05) at section 1.8 Serialisation Formats. Also, see serialize.c (underlying C code to readRDS and saveRDS) with large docstring discussing evolution of versions with its parent caller, serialize.R. Similiarly, saveload.C for save/load with its parent caller, load.R. Also read old NEWS dating back to pre R 1.0 for changes/bug fixes to various methods including load/save and readRDS/saveRDS.

ParfaitG commented 3 years ago

Thank you @bashtage, for your comments. Not to belabor this discussion and I appreciate your time, how about a 4th option: direct command line call to R to access readRDS/saveRDS and load/save methods? While this requires users to have R installed (usually included in Anaconda distributions), we would not need any additional libraries or dependencies and can stay current to any R changes to the .rda and .rds file types. Otherwise, I am inclined with your first option for an IO tools doc change (either new R section or under Other file format section) to include features of the pandas external add-on tool, pyreadr, and invite author to contribute.

For option 4, consider demo for .rds types which converts data to .csv, respecting column names, row names/indexes, and data types using a temp files and directory (like SAS's Work library and Stata's preserve/restore and tempfile). Tested on Windows and Linux. Vanilla, base R is all that is needed (no third-party packages on that end) but users need to have R bin folder in Path environment variable to call Rscript (can be small note and instructions in docs). The pandas IO clipboard module also makes subprocess CLI calls across different OSes. Some CI tests would need an installed R compiler in those environments if not already.

Read Rdata

R

mtcars$now <- Sys.time()          # ADD TIME FOR DEMONSTRATOIN
saveRDS(mtcars, "mtcars.rds")

Python

from datetime import datetime
import os 
from subprocess import Popen, PIPE
from tempfile import TemporaryDirectory

import pandas as pd

def read_rdata(rds_file):
    cmd = "Rscript"

    r_to_py_types = {'logical': 'bool', 'integer':'int64', 'numeric': 'float64',
                     'character': 'str', 'factor': 'str', 'Date': 'date' , 'POSIXct': 'date'}

    with TemporaryDirectory() as tmpdir:   
        py_csv = os.path.join(tmpdir, "pydata.csv")

        # BUILD TEMP SCRIPT TO READ RDS AND OUTPUT TO CONSOLE
        r_code = os.path.join(tmpdir, "r_batch.R")

        with open(r_code, "w") as f:
            f.write("""args <- commandArgs(trailingOnly=TRUE)

                       df_r <- readRDS(args[length(args)-1])                       
                       write.csv(df_r, file=args[length(args)])

                       cat(paste(colnames(df_r), collapse=","),"|",
                           paste(sapply(df_r, function(x) class(x)[1]), collapse=","), 
                           sep="")
                    """)

        # SET UP COMMAND LINE ARGS, RUN COMMAND, RECEIVE OUTPUT/ERROR
        cmds = [cmd, r_code, rds_file, py_csv]

        a = Popen(cmds, stdin=PIPE, stdout=PIPE, stderr=PIPE)
        output, error = a.communicate()
        if error:
            print(error.decode("UTF-8"))

        r_hdrs = [h.split(",") for h in output.decode("UTF-8").split("|")]
        py_types = {n:r_to_py_types[d] for n,d in zip(*r_hdrs)}

        dt_cols = [col for col, d in py_types.items() if d == "date"]
        py_types = {k:v for k,v in py_types.items() if v != "date"}

        # IMPORT PANDAS DATA FRAME
        df = pd.read_csv(py_csv, index_col=0, dtype=py_types, parse_dates=dt_cols)

    return df

py_df = read_rdata("mtcars.rds")

print(py_df.dtypes)
# mpg            float64
# cyl            float64
# disp           float64
# hp             float64
# drat           float64
# wt             float64
# qsec           float64
# vs             float64
# am             float64
# gear           float64
# carb           float64
# now     datetime64[ns]
# dtype: object

print(py_df.head)
#                       mpg  cyl   disp     hp  drat     wt   qsec   vs   am  gear  carb                 now
# Mazda RX4            21.0  6.0  160.0  110.0  3.90  2.620  16.46  0.0  1.0   4.0   4.0 2021-03-22 10:58:34
# Mazda RX4 Wag        21.0  6.0  160.0  110.0  3.90  2.875  17.02  0.0  1.0   4.0   4.0 2021-03-22 10:58:34
# Datsun 710           22.8  4.0  108.0   93.0  3.85  2.320  18.61  1.0  1.0   4.0   1.0 2021-03-22 10:58:34
# Hornet 4 Drive       21.4  6.0  258.0  110.0  3.08  3.215  19.44  1.0  0.0   3.0   1.0 2021-03-22 10:58:34
# Hornet Sportabout    18.7  8.0  360.0  175.0  3.15  3.440  17.02  0.0  0.0   3.0   2.0 2021-03-22 10:58:34

print(py_df.tail)
# Lotus Europa         30.4  4.0   95.1  113.0  3.77  1.513  16.90  1.0  1.0   5.0   2.0 2021-03-22 10:58:34
# Ford Pantera L       15.8  8.0  351.0  264.0  4.22  3.170  14.50  0.0  1.0   5.0   4.0 2021-03-22 10:58:34
# Ferrari Dino         19.7  6.0  145.0  175.0  3.62  2.770  15.50  0.0  1.0   5.0   6.0 2021-03-22 10:58:34
# Maserati Bora        15.0  8.0  301.0  335.0  3.54  3.570  14.60  0.0  1.0   5.0   8.0 2021-03-22 10:58:34
# Volvo 142E           21.4  4.0  121.0  109.0  4.11  2.780  18.60  1.0  1.0   4.0   2.0 2021-03-22 10:58:34

Write Rdata

Python

def write_rdata(frame, rds_file):
    cmd = "Rscript" 

    py_to_r_types = {'int32': 'integer', 'int64': 'integer', 'float64': 'numeric',
                     'object': 'character', 'bool': 'logical', 'datetime64[ns]': 'POSIXct'}

    r_types = ",".join(frame.reset_index().dtypes.replace(py_to_r_types))              

    with TemporaryDirectory() as tmpdir:       
        py_csv = os.path.join(tmpdir, "py_df.csv")
        frame.to_csv(py_csv)

        # BUILD TEMP SCRIPT TO INPUT CSV AND SAVE RDS
        r_code = os.path.join(tmpdir, "r_batch.R")

        with open(r_code, "w") as f:
            f.write("""args <- commandArgs(trailingOnly=TRUE)
                       py_csv <- args[length(args)-2]                       
                       r_types <- strsplit(args[length(args)-1], ",")[[1]]                       

                       df_r <- read.csv(py_csv, colClasses=r_types)
                       df_r <- `row.names<-`(df_r[-1], df_r[[1]])
                       saveRDS(df_r, args[length(args)])
                    """)

        # SET UP COMMAND LINE ARGS, RUN COMMAND, RECEIVE OUTPUT/ERROR  
        cmds = [cmd, r_code, py_csv, r_types, rds_file]

        a = Popen(cmds, stdin=PIPE, stdout=PIPE, stderr=PIPE)
        output, error = a.communicate()
        if error:
            print(error.decode("UTF-8"))

        return None

py_df = (pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/mpg.csv")
           .assign(now=datetime.now()))   # ADD TIME FOR DEMONSTRATON

write_rdata(py_df, "mpg.rds")

R

r_df <- readRDS("mpg.rds")

str(r_df)
# 'data.frame':   398 obs. of  10 variables:
#  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
#  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
#  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
#  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
#  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
#  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
#  $ model_year  : int  70 70 70 70 70 70 70 70 70 70 ...
#  $ origin      : chr  "usa" "usa" "usa" "usa" ...
#  $ name        : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
#  $ now         : POSIXct, format: "2021-03-22 10:58:40" "2021-03-22 10:58:40" "2021-03-22 10:58:40" "2021-03-22 10:58:40" ...

head(r_df)
#   mpg cylinders displacement horsepower weight acceleration model_year origin                      name                 now
# 0  18         8          307        130   3504         12.0         70    usa chevrolet chevelle malibu 2021-03-22 10:58:40
# 1  15         8          350        165   3693         11.5         70    usa         buick skylark 320 2021-03-22 10:58:40
# 2  18         8          318        150   3436         11.0         70    usa        plymouth satellite 2021-03-22 10:58:40
# 3  16         8          304        150   3433         12.0         70    usa             amc rebel sst 2021-03-22 10:58:40
# 4  17         8          302        140   3449         10.5         70    usa               ford torino 2021-03-22 10:58:40
# 5  15         8          429        198   4341         10.0         70    usa          ford galaxie 500 2021-03-22 10:58:40

tail(r_df)
#     mpg cylinders displacement horsepower weight acceleration model_year origin             name                 now
# 392  27         4          151         90   2950         17.3         82    usa chevrolet camaro 2021-03-22 10:58:40
# 393  27         4          140         86   2790         15.6         82    usa  ford mustang gl 2021-03-22 10:58:40
# 394  44         4           97         52   2130         24.6         82 europe        vw pickup 2021-03-22 10:58:40
# 395  32         4          135         84   2295         11.6         82    usa    dodge rampage 2021-03-22 10:58:40
# 396  28         4          120         79   2625         18.6         82    usa      ford ranger 2021-03-22 10:58:40
# 397  31         4          119         82   2720         19.4         82    usa       chevy s-10 2021-03-22 10:58:40
ParfaitG commented 3 years ago

I will get started on a PR that will read/write R data files with pyreadr (as soft dep) and R compiler as engine versions. (Easy r install on the pandas-dev conda environment). And I will add an R section to IO tools docs. For R compiler, I will integrate parquet and feather modes using pyarrow and R's counterpart, arrow, package (above demo runs csv mode). With working solution, we can then tangibly see and decide how to proceed. Hopefully, we can make it for the 1.3 milestone!