grimbough / rhdf5

Package providing an interface between HDF5 and R
http://bioconductor.org/packages/rhdf5
61 stars 22 forks source link

bug and performance problems with "wide" data frames #40

Open benj919 opened 5 years ago

benj919 commented 5 years ago

I'm trying to store a data frame to an hdf5 file. Unfortunately that fails for data frames with more than a certain number of observations/columns. I've narrowed it down to a data frame with 1 variable and 1093 observations failing while 1092 observations work. Setting a small chunk size does not change the outcome as that works on the number of rows/variables as far is I understand it and there is only one variable in this example.

Additionally, having written the smaller data frame to a file successfully it takes more than a full minute to read it back in.

Here the small sample script to illustrate both issues:

library(rhdf5)
# cleanup and create test file
h5closeAll()
if (file.exists("test.h5")) { file.remove("test.h5") }
h5createFile("test.h5")
h5createGroup("test.h5", "test")

# create test data frames
df_ok <- data.frame(t(c(1:1092)*0.5))
df_fail <- data.frame(t(c(1:1093)*0.5))

# write 
h5write(df_ok, file="test.h5", name="test/df_ok")
h5write(df_fail, file="test.h5", name="test/df_fail")

# read
st <- proc.time()
df_ok_read <- h5read(file="test.h5", name="test/df_ok")
proc.time() - st

and the relevant outputs: for the write failure

Error in h5writeDataset.data.frame(obj, loc$H5Identifier, name, ...) : 
  HDF5. Dataset. Unable to initialize object.

and for the read back timing:

   user  system elapsed 
 85.963   0.190  86.213 

Tested on a laptop with Ubuntu 18.04 64bit 16GB ram, R 3.6.0 / rhdf5 2.28.0

> version
               _                           
platform       x86_64-pc-linux-gnu         
arch           x86_64                      
os             linux-gnu                   
system         x86_64, linux-gnu           
status                                     
major          3                           
minor          6.0                         
year           2019                        
month          04                          
day            26                          
svn rev        76424                       
language       R                           
version.string R version 3.6.0 (2019-04-26)
nickname       Planting of a Tree