Open gaow opened 6 years ago
@gaow For the initial version, I'm going to propose a stripped-down, bare-bones version. Hence the name "barebones data object (BDO)".
Barebones data object:
NA
in R) are not allowed.All the data types you proposed above can represented in this format, although it will take some extra steps to convert to the desired representation; e.g. to convert from a list of vectors to a data frame in R.
Note I avoided integers since most integers can represented as doubles, and there are inconsistencies in the way that integers are implemented in R and Python which will cause trouble.
We will use h5py
in Python the hdf5r
package in R to read/write BDOs to hdf5 files.
See here for reference on basic data types in R. See here for reference on the hdf5r
package.
Great thanks @pcarbo for the outline. I mostly agree with what you have suggested. Here are a few issues, though:
matrix
and data.frame
? or only for now?Since potentially R will have more restrictions than Python, it may be good idea that we have R-based I/O functions and results first, then I'll try to make it Python compatible.
Looking forward!
Why are we leaving out matrix and data.frame?
A matrix is a 2-d array.
A data frame is just a list of vectors (with some extra attributes like rownames
).
In R, is it important to distinguish between int and double?
It is important, but not essential; integers are represented differently in R and Python, so it seemed like a major headache to deal with this data type. See for example here and here for some complexities.
Related to this issue is the support to multiple explicit file outputs per module. If we can get that work we'll be able to load files directly; although users will have to provide means to load data for different languages.
@pcarbo and I have decided to give
HDF5
a stab as replacement to current defaultRDS
storage format. We start fromR
andPython
. The basic data types we'd like to support are:np
fornumpy
,pd
forpandas
. Here is a test on Python's end:Here is the outcome in HDF5:
test.h5.zip
I used this API from UChicago:
https://github.com/uchicago-cs/deepdish/blob/master/deepdish/io/hdf5io.py
But it would not be difficult, I presume, to customize.
A particular difficult case is
NULL/NA/NaN
in R. In Python there are onlyNone
andNaN
, noNULL
. #25@pcarbo am I missing any? would be interesting if you can check that HDF5 in R. Hopefully that gives us some useful insides in how we make cross language format consistent.