Open ParfaitG opened 3 years ago
Are R data files commonly used to data exchange? One of the arguments for SAS and Stata are that is it unfortunately common to see organizations publishing datasets in these formats. I can't see I've ever seen RDS used in this way.
@ParfaitG Interesting proposal. We would have to be a bit careful with naming here, as read_rds
could very easily lead to confusion with AWS RDS.
Good point @bashtage! Perhaps since R is a programming language or environment and not traditional software with proprietary types, the .rds format is not traditionally used in data exchange. Raw data and code would be enough to reproduce end use data.
However, anecdotally among internal teams, many advanced useRs use this format to save cleaned data, plotting data, modeling results, etc. due to its efficiency as a binary and compressed serialization type without need to parse text files and detect types. Also, the .rda format is the dominant data storage format in R packages. And this need is routinely asked on StackOverflow usually for rpy2
(requires R installed) or pyreadr
solutions. A convenience pandas handler can help the open source ecosystem.
@dsaxton, I didn't think of that name collision. We can call it read_rdata
especially since the aforementioned C library can read and write .rds
(a single R object) and .rda
(multiple R objects) types.
To underscore the regularly asked need on StackOverflow throughout the years:
I think supporting R data files is reasonable. The big question would be how shoudl support be added. Would it be better to take a soft dep on pyreadr like pandas does for most IO (e.g., openpyxl for Excel)? This way it will work as expected if this library is available. It saves the cost of maintaining a vendored code snippet and keeping it synced upstream. The downside is that new releases on a soft dep can break CI.
Thanks, @bashtage, my thoughts it to bypass pyreadr
altogether which is simply a Python wrapper to the C library, librdata
. We only need to borrow and moderately adjust their Cython code (.pxd and .pyx), crediting authors.
Specifically, the pandas plan would include for no external dependencies:
librdata
's C src
files (12 files at 102 kb) either in pandas._lib
or pandas.io.rdata
, crediting author.
(Rarely will this library need update as rds
and rda
are sacrosanct R core types.).pxd
and .pyx
for the r_parser
module (i.e., pandas.io.sas
cythonizes its _sas
module).r_parser
in new pandas.io.rdata
module as demonstrated above (with updated results).Now, would the pandas team be open to a new io C extension? r_parser.so
builds at same size as _sas.so
(~1 mb).
pyreadr developer here. I personally would suggest to use pyreadr as soft dep. It is not correct that rds and rda formats do not change ,they do with major and minor versions of R and these changes are undocumented (see for example here, here, here). And we are still improving as we cannot read all existing features, as again everything is undocumented. That means if you do do your own code base you will have to maintain it (maintenance would be completely on your side since I don't have capacity to maimtain two code bases)
I also develop pyreadstat, pandas is using it as soft dep for read_spss and that approach seems to be working really well.
Of course pyreadr is an opensource project, so you are free take the code. However take into account that the license of pyreadr is very restrictive, I am not sure what kind of license pandas has, but you have to ensure that the restrictions for these pieces of code, even if they become detached from pyreadr stay as strict as they are now. You will also need to distribe the pyreadr license and attached licenses toghether with pandas license. I will also ask you to do the first commit with my github handle so that I appear as contributor to the repo.
Thank you, @ofajardo, for your input! First, do be aware you can take the lead on a PR for this proposed IO module. As an author who relies on pandas, why not become an original contributor? If using soft dep approach, you can follow the similar setup of pandas.io.spss
.
Given your response, here are my thoughts:
Circular Imports: If using pyreadr as soft dep, we will be building a pandas module that imports a package that imports pandas. Many of the imported modules used in other pandas IO modules (beautifulsoup, lxml, openpyxl, etc.) do not import pandas or serve as wrapper to pandas. The contributor to read_spss
may not have known about the underlying C library and proposed a packaged solution. I had hoped to streamline this with direct connection to lower level librdata.
Dependency Restriction: One of the great aspects of pandas is that it is a full service solution right off the shelf. Currently the SAS and Stata IO modules does not require an additional pip-installed package to work. How can we allow that same convenience for the rdata IO module? Surely, as open source stacks, we can better integrate a data exchange solution between Python and R? Also, the sas7bdat and PyData/StatsModels authors worked with pandas to build a viable solution from original scripts without dependencies. See topline credits on sas (which also uses an author license) and stata scripts. I had hoped to work with you on a similar arrangement with integration of your Cython module.
Avoid Redundancy: Pandas IO API, mature now for years, includes a suite of IO handling including reading from URL, FTP, storage options like Amazon S3 buckets, file-like objects, buffers, and various compression types with likely additional future features. Additionally, pandas DataFrame API can handle timezone handling, data type migration, categorical dtype, and other nuanced needs with underlying core functions not available to pandas package end users. I had hoped to avoid this overlap of functionality and ensure consistency across other pandas IO modules.
Librdata Functionality: As mentioned earlier, rds and rda files rarely (not never) changes. But any future issues pandas users encounter, we can report directly to librdata authors much like pyreadr does or make necessary changes to Python, Cython, and C files. Also, as a proficient R user (gold badge holder on StackOverflow in R, Python, and pandas), I well understand the object model of R and so can adjust or recommend workarounds. As a soft dep, pyreadr limits the use of librdata to its needs, assumptions, and abilities (i.e., timezone, rownames). I had hoped to utilize the full functionality with direct access to librdata.
PEP Standards: Pandas mostly stays current on Python's PEP standards across its functionality including type hinting/annotation, avoidance of relative imports, etc. A brief review of pyreadr does not indicate adherence to current PEP standards. I also see some need to update older uses such as OrderedDict
(may not be needed for Python 3.6+ recommended for pandas users) and %
operator for string formatting (de-emphasized in Python 3 but not deprecated yet). Possibly, too, code base can integrate pythonic semantics like list/dict comprehensions, generators, f-strings, and others. With rigorous quality standards of pandas using mypy, styling, and CI testing, I had hoped to ensure a robust rdata IO module.
With that said, thank you for authoring various data exchange packages in the pandas ecosystem over the years! From SO posts above, many have been grateful. I am looking into other solutions to build this specific IO support and I may have a different approach in mind.
Circular Imports: If using pyreadr as soft dep, we will be building a pandas module that imports a package that imports pandas. Many of the imported modules used in other pandas IO modules (beautifulsoup, lxml, openpyxl, etc.) do not import pandas or serve as wrapper to pandas. The contributor to read_spss may not have known about the underlying C library and proposed a packaged solution. I had hoped to streamline this with direct connection to lower level librdata.
It is important to acknowledge that there is a non-trivial developer cost to streamlining. There are three options here:
Dependency Restriction: One of the great aspects of pandas is that it is a full service solution right off the shelf.
More IO formats do than don't. An incomplete list of formats that require a soft dep:
Avoid Redundancy: Pandas IO API, mature now for years, includes a suite of IO handling including reading from URL, FTP, storage options like Amazon S3 buckets, file-like objects, buffers, and various compression types with likely additional future features. Additionally, pandas DataFrame API can handle timezone handling, data type migration, categorical dtype, and other nuanced needs with underlying core functions not available to pandas package end users. I had hoped to avoid this overlap of functionality and ensure consistency across other pandas IO modules.
I don't see how this argues against a soft dependency. pandas could take it as a soft dep and still provide a uniform API on top, including building any missing features, or converting between what the dep prefers and what pandas prefers.
Librdata Functionality: As mentioned earlier, rds and rda files rarely (not never) changes.
There have been about a page full of commits in the past year. If these are all necessary then it seems to drift around a bit.
- PEP Standards: Pandas mostly stays current on Python's PEP standards across its functionality including type hinting/annotation, avoidance of relative imports, etc. A brief review of pyreadr does not indicate adherence to current PEP standards. I also see some need to update older uses such as
OrderedDict
(may not be needed for Python 3.6+ recommended for pandas users) and%
operator for string formatting (de-emphasized in Python 3 but not deprecated yet). Possibly, too, code base can integrate pythonic semantics like list/dict comprehensions, generators, f-strings, and others. With rigorous quality standards of pandas using mypy, styling, and CI testing, I had hoped to ensure a robust rdata IO module.
I'm don't think this is much of an argument against the soft dep approach. Each package is allowed to have their own accepted code style. NumPy is pretty far from "full" PEP yet no one suggests not building on NumPy.
You also seem to acknowledge pyreadr in your Cython above. You cannot use any code from pyreadr since it is GPL. A vendored version will need to have a clean-sheet implementation that directly wraps the C library without using code from pyreadr.
hey @ParfaitG thanks for your thoughtful answer!
I am still aligned with @bashtage thinking that a soft dep is better in this (and other io) case(s); and in general that modular is better than monolithic. It seems that there are enough successful examples of this approach in pandas as @bashtage has pointed out to demonstrate the approach works very well ... But, that's just my humble opinion, and I am not a pandas dev, so up to you guys to decide!
In case you guys would like to go for a soft dep, you got my full collaboration to do changes in pyreadr to align and better integrate with pandas, including cleaning the code to make it more PEP conforming; either doing my self or accepting PRs from others.
As @bashtage suggests, notice that the license of pyreadr is AGPL so it probably clashes with pandas and indeed you cannot take it. But doing a better wrapper for librdata from scratch (or some other approach as you mentioned) should be no issue for you in case you guys decide to go for an internal module.
Just a couple of other comments:
pyreadr limits the use of librdata to its needs, assumptions, and abilities (i.e., timezone, rownames). I had hoped to utilize the full functionality with direct access to librdata.
I actually am using the full libradata API trying to be as comprehensive as possible. Librdata has currently a lot of limitations you won't be able to overcome unless you directly contribute to librdata C code. If you include librdata as hard dep you will start getting issues around R lists not read, S4 objects not read, etc (just check pyreadr and librdata issues to see what I mean). I currently don't have capacity to work on those issues, but if you do, and you fix those things in librdata + your internal module, that would be a step forward! An in case you decide for a soft rep and have ideas on how to improve pyreadr and would like to contribute, you would be very welcome!
However if you truly want to become in full control of the process and overcome the current limitations imposed by librdata you should consider writing the convertor truly from scratch without relying on librdata.
I also see some need to update older uses such as OrderedDict
I decided to support older versions of python as much as possible, but I understand your dissagreement with that. What I see in reality is that we do have old production servers with old centos which are still running with python 3.5 and 3.4, from there the motivation to keep backward compatibility at expense of PEP styling.
Surely, as open source stacks, we can better integrate a data exchange solution between Python and R?
I hope that data exchange between Python, R and other technologies happen through libraries designed for that purpose such as feather/arrow. R binary files are undocumented and fully supported only by R having very low interoperabiliy, so in my opinion is a very poor solution for data exchange.
@ofajardo, FYI - re documentation, see CRAN doc R Internals (updated 2021-03-05) at section 1.8 Serialisation Formats. Also, see serialize.c (underlying C code to readRDS
and saveRDS
) with large docstring discussing evolution of versions with its parent caller, serialize.R. Similiarly, saveload.C for save
/load
with its parent caller, load.R. Also read old NEWS dating back to pre R 1.0 for changes/bug fixes to various methods including load
/save
and readRDS
/saveRDS
.
Thank you @bashtage, for your comments. Not to belabor this discussion and I appreciate your time, how about a 4th option: direct command line call to R to access readRDS
/saveRDS
and load
/save
methods? While this requires users to have R installed (usually included in Anaconda distributions), we would not need any additional libraries or dependencies and can stay current to any R changes to the .rda and .rds file types. Otherwise, I am inclined with your first option for an IO tools doc change (either new R section or under Other file format section) to include features of the pandas external add-on tool, pyreadr
, and invite author to contribute.
For option 4, consider demo for .rds types which converts data to .csv, respecting column names, row names/indexes, and data types using a temp files and directory (like SAS's Work
library and Stata's preserve/restore
and tempfile
). Tested on Windows and Linux. Vanilla, base R is all that is needed (no third-party packages on that end) but users need to have R bin folder in Path
environment variable to call Rscript
(can be small note and instructions in docs). The pandas IO clipboard module also makes subprocess
CLI calls across different OSes. Some CI tests would need an installed R compiler in those environments if not already.
R
mtcars$now <- Sys.time() # ADD TIME FOR DEMONSTRATOIN
saveRDS(mtcars, "mtcars.rds")
Python
from datetime import datetime
import os
from subprocess import Popen, PIPE
from tempfile import TemporaryDirectory
import pandas as pd
def read_rdata(rds_file):
cmd = "Rscript"
r_to_py_types = {'logical': 'bool', 'integer':'int64', 'numeric': 'float64',
'character': 'str', 'factor': 'str', 'Date': 'date' , 'POSIXct': 'date'}
with TemporaryDirectory() as tmpdir:
py_csv = os.path.join(tmpdir, "pydata.csv")
# BUILD TEMP SCRIPT TO READ RDS AND OUTPUT TO CONSOLE
r_code = os.path.join(tmpdir, "r_batch.R")
with open(r_code, "w") as f:
f.write("""args <- commandArgs(trailingOnly=TRUE)
df_r <- readRDS(args[length(args)-1])
write.csv(df_r, file=args[length(args)])
cat(paste(colnames(df_r), collapse=","),"|",
paste(sapply(df_r, function(x) class(x)[1]), collapse=","),
sep="")
""")
# SET UP COMMAND LINE ARGS, RUN COMMAND, RECEIVE OUTPUT/ERROR
cmds = [cmd, r_code, rds_file, py_csv]
a = Popen(cmds, stdin=PIPE, stdout=PIPE, stderr=PIPE)
output, error = a.communicate()
if error:
print(error.decode("UTF-8"))
r_hdrs = [h.split(",") for h in output.decode("UTF-8").split("|")]
py_types = {n:r_to_py_types[d] for n,d in zip(*r_hdrs)}
dt_cols = [col for col, d in py_types.items() if d == "date"]
py_types = {k:v for k,v in py_types.items() if v != "date"}
# IMPORT PANDAS DATA FRAME
df = pd.read_csv(py_csv, index_col=0, dtype=py_types, parse_dates=dt_cols)
return df
py_df = read_rdata("mtcars.rds")
print(py_df.dtypes)
# mpg float64
# cyl float64
# disp float64
# hp float64
# drat float64
# wt float64
# qsec float64
# vs float64
# am float64
# gear float64
# carb float64
# now datetime64[ns]
# dtype: object
print(py_df.head)
# mpg cyl disp hp drat wt qsec vs am gear carb now
# Mazda RX4 21.0 6.0 160.0 110.0 3.90 2.620 16.46 0.0 1.0 4.0 4.0 2021-03-22 10:58:34
# Mazda RX4 Wag 21.0 6.0 160.0 110.0 3.90 2.875 17.02 0.0 1.0 4.0 4.0 2021-03-22 10:58:34
# Datsun 710 22.8 4.0 108.0 93.0 3.85 2.320 18.61 1.0 1.0 4.0 1.0 2021-03-22 10:58:34
# Hornet 4 Drive 21.4 6.0 258.0 110.0 3.08 3.215 19.44 1.0 0.0 3.0 1.0 2021-03-22 10:58:34
# Hornet Sportabout 18.7 8.0 360.0 175.0 3.15 3.440 17.02 0.0 0.0 3.0 2.0 2021-03-22 10:58:34
print(py_df.tail)
# Lotus Europa 30.4 4.0 95.1 113.0 3.77 1.513 16.90 1.0 1.0 5.0 2.0 2021-03-22 10:58:34
# Ford Pantera L 15.8 8.0 351.0 264.0 4.22 3.170 14.50 0.0 1.0 5.0 4.0 2021-03-22 10:58:34
# Ferrari Dino 19.7 6.0 145.0 175.0 3.62 2.770 15.50 0.0 1.0 5.0 6.0 2021-03-22 10:58:34
# Maserati Bora 15.0 8.0 301.0 335.0 3.54 3.570 14.60 0.0 1.0 5.0 8.0 2021-03-22 10:58:34
# Volvo 142E 21.4 4.0 121.0 109.0 4.11 2.780 18.60 1.0 1.0 4.0 2.0 2021-03-22 10:58:34
Python
def write_rdata(frame, rds_file):
cmd = "Rscript"
py_to_r_types = {'int32': 'integer', 'int64': 'integer', 'float64': 'numeric',
'object': 'character', 'bool': 'logical', 'datetime64[ns]': 'POSIXct'}
r_types = ",".join(frame.reset_index().dtypes.replace(py_to_r_types))
with TemporaryDirectory() as tmpdir:
py_csv = os.path.join(tmpdir, "py_df.csv")
frame.to_csv(py_csv)
# BUILD TEMP SCRIPT TO INPUT CSV AND SAVE RDS
r_code = os.path.join(tmpdir, "r_batch.R")
with open(r_code, "w") as f:
f.write("""args <- commandArgs(trailingOnly=TRUE)
py_csv <- args[length(args)-2]
r_types <- strsplit(args[length(args)-1], ",")[[1]]
df_r <- read.csv(py_csv, colClasses=r_types)
df_r <- `row.names<-`(df_r[-1], df_r[[1]])
saveRDS(df_r, args[length(args)])
""")
# SET UP COMMAND LINE ARGS, RUN COMMAND, RECEIVE OUTPUT/ERROR
cmds = [cmd, r_code, py_csv, r_types, rds_file]
a = Popen(cmds, stdin=PIPE, stdout=PIPE, stderr=PIPE)
output, error = a.communicate()
if error:
print(error.decode("UTF-8"))
return None
py_df = (pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/mpg.csv")
.assign(now=datetime.now())) # ADD TIME FOR DEMONSTRATON
write_rdata(py_df, "mpg.rds")
R
r_df <- readRDS("mpg.rds")
str(r_df)
# 'data.frame': 398 obs. of 10 variables:
# $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
# $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
# $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
# $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
# $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
# $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
# $ model_year : int 70 70 70 70 70 70 70 70 70 70 ...
# $ origin : chr "usa" "usa" "usa" "usa" ...
# $ name : chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
# $ now : POSIXct, format: "2021-03-22 10:58:40" "2021-03-22 10:58:40" "2021-03-22 10:58:40" "2021-03-22 10:58:40" ...
head(r_df)
# mpg cylinders displacement horsepower weight acceleration model_year origin name now
# 0 18 8 307 130 3504 12.0 70 usa chevrolet chevelle malibu 2021-03-22 10:58:40
# 1 15 8 350 165 3693 11.5 70 usa buick skylark 320 2021-03-22 10:58:40
# 2 18 8 318 150 3436 11.0 70 usa plymouth satellite 2021-03-22 10:58:40
# 3 16 8 304 150 3433 12.0 70 usa amc rebel sst 2021-03-22 10:58:40
# 4 17 8 302 140 3449 10.5 70 usa ford torino 2021-03-22 10:58:40
# 5 15 8 429 198 4341 10.0 70 usa ford galaxie 500 2021-03-22 10:58:40
tail(r_df)
# mpg cylinders displacement horsepower weight acceleration model_year origin name now
# 392 27 4 151 90 2950 17.3 82 usa chevrolet camaro 2021-03-22 10:58:40
# 393 27 4 140 86 2790 15.6 82 usa ford mustang gl 2021-03-22 10:58:40
# 394 44 4 97 52 2130 24.6 82 europe vw pickup 2021-03-22 10:58:40
# 395 32 4 135 84 2295 11.6 82 usa dodge rampage 2021-03-22 10:58:40
# 396 28 4 120 79 2625 18.6 82 usa ford ranger 2021-03-22 10:58:40
# 397 31 4 119 82 2720 19.4 82 usa chevy s-10 2021-03-22 10:58:40
I will get started on a PR that will read/write R data files with pyreadr
(as soft dep) and R compiler as engine
versions. (Easy r
install on the pandas-dev
conda environment). And I will add an R section to IO tools docs. For R compiler, I will integrate parquet
and feather
modes using pyarrow
and R's counterpart, arrow
, package (above demo runs csv mode). With working solution, we can then tangibly see and decide how to proceed. Hopefully, we can make it for the 1.3 milestone!
Currently, Pandas IO tools for binary files support largely the commercial statistical packages (SAS, Stata, SPSS). Interestingly, R binary types (.rds, .rda) are not included. Since many data science teams work between the open source stacks, some IO pandas support of R data files may be worthwhile to pursue.
I know there is some history of pandas with rpy2. However, there may be a way to integrate an IO module for R data files without optional dependency (i.e, pyreadr) but using a lightweight C library: librdata. Also, R's
saveRDS
uses compression types (gzip
,bzip2
, andxz
) already handled with pandas io.Thanks to the authors of pyreadr and librdata (not unlike the
sas7bdat
authors forread_sas
orPyDTA
authors forread_stata
), I was able to implement a demo on an uncompressed rds type.R
Python (using a Cython built module)
Parser
Writer
R