grimbough / rhdf5

Package providing an interface between HDF5 and R
http://bioconductor.org/packages/rhdf5
60 stars 21 forks source link

Support of long vectors #8

Closed lambdamoses closed 6 years ago

lambdamoses commented 6 years ago

I was trying to use h5read to load a single cell RNA seq dataset, from 1.3 million cells, stored as a sparse matrix. Because there're so many cells, there're 2.5 billion non-zero values, which will make it a long vector in R. But h5read just threw an error saying "the dims contain negative values" when it came to the long vectors. Everything else, like the gene names and barcodes, loaded perfectly fine. This shouldn't be caused by lack of memory; I did it on an AWS EC2 instance with 160GB of memory, more than enough for this dataset, and still got that error. I did manage to load the long vectors by loading them piecemeal and concatenating the pieces, but it would be nice to be able to load them all at once. Does rhdf5 support long vectors? If not, I think it's a good idea to make it support long vectors since we have more and more of large datasets.

grimbough commented 6 years ago

Perhaps you can provide some example code so I can see what approach you're using to read the data? It's certainly true that you might encounter issues if you're reading really large vectors, but it's easier to comment if I can see exactly what strategy you're using. I guess the full data file you're working with is too large to share, but even a reduced version would be great.

A single-cell dataset with 1.3 million cells sounds a lot like you're using the 10X Genomics mouse brain data. Perhaps I can point you to https://github.com/Bioconductor/TENxBrainData/ and www.msmith.de, where there's a bit of discussion about representing & working with that dataset in Bioconductor.

lambdamoses commented 6 years ago

Here's the code:

library(rhdf5)
# Show the structure of the file
fn <- "../Data/1M_neurons_filtered_gene_bc_matrices_h5.h5"
h5ls(fn)
x group name otype dclass dim
0 / mm10 H5I_GROUP    
1 /mm10 barcodes H5I_DATASET STRING 1300774
2 /mm10 data H5I_DATASET INTEGER 2542672695
3 /mm10 gene_names H5I_DATASET STRING 27998
4 /mm10 genes H5I_DATASET STRING 27998
5 /mm10 indices H5I_DATASET INTEGER 2542672695
6 /mm10 indptr H5I_DATASET INTEGER 1300775
7 /mm10 shape H5I_DATASET INTEGER 2
# Load data
data <- h5read(fn, "mm10", bit64conversion = "double")

Error in H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem,  : 
  the dims contain negative values
Error in H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem,  : 
  the dims contain negative values

The data and the indices entries were NULL, so I think the errors were from them. Everything else loaded fine. On an Amazon EC2 instance, I used this function to load the two long vectors piecemeal; it worked, but it took a while. It DID take over half an hour to load the entire thing as suggested on www.msmith.de, forgot exactly how long; it seems that R was only using 1 core when loading the data; probably I should have used mclapply in the piecemeal loading function. Anyway, since the Matrix package doesn't support long vectors yet, I can't turn the full dataset into a sparse matrix with sparseMatrix, so right now, I'm working with a subset of the data.

h5read_piecemeal <- function(file, name, size_total, size_piece, ...) {
  n_pieces <- ifelse(size_total %% size_piece == 0, 
                     size_total %/% size_piece, 
                     size_total %/% size_piece + 1)
  pieces <- seq_len(n_pieces)
  inds <- lapply(pieces, 
                 function(x) {
                   if (x != max(pieces))
                     ((x - 1) * size_piece + 1):(x * size_piece)
                   else 
                     ((x - 1) * size_piece + 1):size_total})
  data_pieces <- lapply(pieces, function(i) h5read(file, name, index = inds[i], ...))
  Reduce(c, data_pieces)
}

I have a reduced version, which loaded perfectly fine and pretty quickly just with h5read, without loading it piecemeal; it won't reproduce the problem. I think the problem is with the size of those two vectors. Here's the link to the reduced dataset: http://cf.10xgenomics.com/samples/cell-exp/1.3.0/1M_neurons/1M_neurons_neuron20k.h5