stephenslab / dsc

Repo for Dynamic Statistical Comparisons project
https://stephenslab.github.io/dsc-wiki
MIT License
12 stars 12 forks source link

New format on DSC data #86

Open gaow opened 6 years ago

gaow commented 6 years ago

@pcarbo and I have decided to give HDF5 a stab as replacement to current default RDS storage format. We start from R and Python. The basic data types we'd like to support are:

HDF5 R Python
? character str
? integer int, np.int, np.uint
? double float, np.float*
? vector list, np.array
? matrix np.matrix
? array np.array, list of lists
? data.frame pd.DataFrame
? NaN np.nan
? Na None

np for numpy, pd for pandas. Here is a test on Python's end:

import numpy as np
import pandas as pd
data = {'charater': 'pcarbo', 
        'integer1': 1, 'integer2': np.uint8(1), 
        'double1': 1.0, 'double2': np.float16(1.0), 
        'vector1': [1,2,'gaow'], 'vector2': [1,2,3], 'vector3': np.array([1,2,3]),
        'matrix': np.matrix([[1,2],[3,4]]),
        'array1': np.array([[1,2],[3,4]]), 'array2': [[1,2],[3,4]],
        'dataframe': pd.DataFrame({'A': [1,2], 'B': [3,4]}, index=['row1', 'row2'])
       }
data['recursive'] = data

Here is the outcome in HDF5:

test.h5.zip

I used this API from UChicago:

https://github.com/uchicago-cs/deepdish/blob/master/deepdish/io/hdf5io.py

But it would not be difficult, I presume, to customize.

A particular difficult case is NULL/NA/NaN in R. In Python there are only None and NaN, no NULL. #25

@pcarbo am I missing any? would be interesting if you can check that HDF5 in R. Hopefully that gives us some useful insides in how we make cross language format consistent.

pcarbo commented 6 years ago

@gaow For the initial version, I'm going to propose a stripped-down, bare-bones version. Hence the name "barebones data object (BDO)".

Barebones data object:

  1. Only one object is stored in an hdf5 file. The object is a list.
  2. All elements of the list are stored in separate nodes ("groups").
  3. Each list element may be one of: (a) array containing double-precision floating point numbers ("doubles"), (b) array containing character strings, or (c) a list.
  4. Lists within lists are stored hierarchically as subnodes in the hdf5 file.
  5. Each list element may have zero, one or more named attributes. Each of these attributes is an array storing characters or doubles (lists are not allowed).
  6. Missing values (NA in R) are not allowed.

All the data types you proposed above can represented in this format, although it will take some extra steps to convert to the desired representation; e.g. to convert from a list of vectors to a data frame in R.

Note I avoided integers since most integers can represented as doubles, and there are inconsistencies in the way that integers are implemented in R and Python which will cause trouble.

We will use h5py in Python the hdf5r package in R to read/write BDOs to hdf5 files.

See here for reference on basic data types in R. See here for reference on the hdf5r package.

gaow commented 6 years ago

Great thanks @pcarbo for the outline. I mostly agree with what you have suggested. Here are a few issues, though:

  1. Why are we leaving out matrix and data.frame? or only for now?
  2. In R, is it important to distinguish between int and double?

Since potentially R will have more restrictions than Python, it may be good idea that we have R-based I/O functions and results first, then I'll try to make it Python compatible.

Looking forward!

pcarbo commented 6 years ago

Why are we leaving out matrix and data.frame?

A matrix is a 2-d array.

A data frame is just a list of vectors (with some extra attributes like rownames).

In R, is it important to distinguish between int and double?

It is important, but not essential; integers are represented differently in R and Python, so it seemed like a major headache to deal with this data type. See for example here and here for some complexities.

gaow commented 6 years ago

Related to this issue is the support to multiple explicit file outputs per module. If we can get that work we'll be able to load files directly; although users will have to provide means to load data for different languages.