grimbough / rhdf5

Package providing an interface between HDF5 and R
http://bioconductor.org/packages/rhdf5
60 stars 21 forks source link

Failure to write dataframe with large number of columns. #72

Closed mmccarthy404 closed 3 years ago

mmccarthy404 commented 3 years ago

I'm experiencing the following error trying to write a dataframe with approximate dimensions of 150 by 2000:

 Error in h5writeDataset.data.frame(obj, loc$H5Identifier, name, ...) : 
  HDF5. Dataset. Unable to initialize object. 

I created this small example to highlight the problem that I believe is due to the large number of columns:

write_hdf5 <- function() {
  # Create temp file.
  tmp <- tempfile(fileext=".h5")

  # Create large dataframe.
  dim <- 1500
  data <- data.frame(matrix(sample(c(0,1), replace=TRUE, size=dim*dim), nrow = dim))

  # Write HDF5.
  rhdf5::h5createFile(file = tmp)
  rhdf5::h5createGroup(tmp, "grp_1")
  rhdf5h5write(head(data), tmp, "grp_1/data", DataFrameAsCompound=TRUE)

  # Delete temp file.
  unlink(tmp)
}

write_hdf5()

For me, this fails to write any dataframe with more than 192 columns.

Any assistance is appreciated!

grimbough commented 3 years ago

I think this is due to a limitation of HDF5 itself, where the object header (in this case the object is a compound dataset representing the data.frame) has a maximum size of 64kb, and once we exceed 1092 columns that limit is broken. Here's another example of someone hitting the same problem: https://forum.hdfgroup.org/t/storing-a-large-number-of-floating-point-numbers-that-evolves-over-time/4026

I don't think there's a fix to solve this directly, rather you might need to write the data in a slightly different format.

The simplest alternative is to use:

rhdf5::h5write(head(data), tmp, "grp_1/data", DataFrameAsCompound = FALSE)

This will create a separate dataset for each column in the data.frame, named using the column names.

If you need to read back into R, h5read(..., "grp_1/data") will be able to read all columns back into a list, which you can then turn into a data.frame.

Alternative, if your data.frame is all of one type (in you're example it's initially a matrix of numeric values) you could stick with the matrix format and it will write a 2D dataset rather than a compound. The dimension limits are pretty huge for a single datatype and you can easily to 150 by 2000, but it will only work if your data.frame can be sensibly coerced to a matrix.

It's hard to know which will be better without knowing a bit more about the actual data you're trying to write, but I'm happy to iterate on advice.