grimbough / rhdf5

Package providing an interface between HDF5 and R
http://bioconductor.org/packages/rhdf5
61 stars 22 forks source link

Limited number of records #30

Closed RamilNurtdinov closed 5 years ago

RamilNurtdinov commented 5 years ago

Dear Colleagues

I am working with diffHic package (https://bioconductor.org/packages/release/bioc/html/diffHic.html) diffHic in turn uses rhdf5 downstream, in order to store the data. I suppose that diffHic stores six integers in each row.

Now the amount of sequencing data (reads) growing so much, that I had to deal with tens of billions of rows. For instance this causes integer overflow in R.

Back to rhdf5. I am trying to merge two "h5" files. Both are approximately 24G in size. I expect that there are more than 2 billion rows in each (up to 5 billion 5*10^6) and got error:

Error in h5writeDataset.data.frame(obj, loc$H5Identifier, name, ...) : HDF5. Dataset. Unable to initialize object. Calls: mergePairs ... h5write.default -> h5writeDataset -> h5writeDataset.data.frame Execution halted

I believe that it is caused by many billions rows. It would be nice if you package will be able to operate with so many rows.
The progress of all biological sequencing data demands this.

RamilNurtdinov commented 5 years ago

I have update. This issue is related to number of rows but slightly different. I observed the error two times with two computers having different versions of linux and R

This is a simple example:

library(rhdf5)

counts_1 <- (1:500000000) counts_2 <- (1:500000000) + 1 counts_3 <- (1:500000000) + 2 counts_4 <- (1:500000000) + 3 counts_5 <- (1:500000000) + 4 counts_6 <- (1:500000000) + 5 combined_data <- data.frame(c_1=counts_1, co_2=counts_2, c_3=counts_3, c_4=counts_4, c_5=counts_5, c_6=counts_6)

dim(combined_data)

[1] 500000000 6

h5createFile("myhdf5file.h5") h5createGroup("myhdf5file.h5","foo") h5write(combined_data, "myhdf5file.h5","foo/A")

Error in h5writeDataset.data.frame(obj, loc$H5Identifier, name, ...) :

HDF5. Dataset. Unable to initialize object.

grimbough commented 5 years ago

Thanks for the reproducible example. I think this and the issue reported in #32 are related. However, the fix introduced there will not work for data.frames as the code for writing these is completely separate. However it gives me somewhere to look for a solution.

grimbough commented 5 years ago

I think this should now be fixed in rhdf5 version 2.27.7. When writing data.frames rhdf5 tries to make a dataset with chunk size equal to the number of rows in the data.frame. However there's a hard limit of 4GB, so if the product of the number of rows, number of columns, and the size of the data types stored exceed 4GB it would crash. I've now modified it to check this, and limit the chunk size to 4GB if required.

Here's what I get with your example now (it take quite a while to run!):

library(rhdf5)

counts_1 <- (1:500000000)
counts_2 <- (1:500000000) + 1L
counts_3 <- (1:500000000) + 2L
counts_4 <- (1:500000000) + 3L
counts_5 <- (1:500000000) + 4L
counts_6 <- (1:500000000) + 5L
combined_data <- data.frame(c_1=counts_1, c_2=counts_2, c_3=counts_3, 
                            c_4=counts_4, c_5=counts_5, c_6=counts_6)

## write out data
h5f <- tempfile(fileext = ".h5")
h5createFile(h5f)
h5createGroup(h5f,"foo")
h5write(combined_data, h5f,"foo/A")

## check content
h5ls(h5f)
  group name       otype   dclass       dim
0     /  foo   H5I_GROUP                   
1  /foo    A H5I_DATASET COMPOUND 500000000

However chunk size can really impact the performance of reading back from disk, so your use case involves accessing many rows at random, you might want to experiment with setting this yourself rather than relying on the maximum chunk size.

You can do this by providing the chunk argument to h5write() e.g

h5write(combined_data, h5f, "foo/A", chunk = 10000)