Support of long vectors

lambdamoses commented 6 years ago

I was trying to use h5read to load a single cell RNA seq dataset, from 1.3 million cells, stored as a sparse matrix. Because there're so many cells, there're 2.5 billion non-zero values, which will make it a long vector in R. But h5read just threw an error saying "the dims contain negative values" when it came to the long vectors. Everything else, like the gene names and barcodes, loaded perfectly fine. This shouldn't be caused by lack of memory; I did it on an AWS EC2 instance with 160GB of memory, more than enough for this dataset, and still got that error. I did manage to load the long vectors by loading them piecemeal and concatenating the pieces, but it would be nice to be able to load them all at once. Does rhdf5 support long vectors? If not, I think it's a good idea to make it support long vectors since we have more and more of large datasets.

grimbough commented 6 years ago

Perhaps you can provide some example code so I can see what approach you're using to read the data? It's certainly true that you might encounter issues if you're reading really large vectors, but it's easier to comment if I can see exactly what strategy you're using. I guess the full data file you're working with is too large to share, but even a reduced version would be great.

A single-cell dataset with 1.3 million cells sounds a lot like you're using the 10X Genomics mouse brain data. Perhaps I can point you to https://github.com/Bioconductor/TENxBrainData/ and www.msmith.de, where there's a bit of discussion about representing & working with that dataset in Bioconductor.

lambdamoses commented 6 years ago

Here's the code:

library(rhdf5)
# Show the structure of the file
fn <- "../Data/1M_neurons_filtered_gene_bc_matrices_h5.h5"
h5ls(fn)

x	group	name	otype	dclass	dim
0	/	mm10	H5I_GROUP
1	/mm10	barcodes	H5I_DATASET	STRING	1300774
2	/mm10	data	H5I_DATASET	INTEGER	2542672695
3	/mm10	gene_names	H5I_DATASET	STRING	27998
4	/mm10	genes	H5I_DATASET	STRING	27998
5	/mm10	indices	H5I_DATASET	INTEGER	2542672695
6	/mm10	indptr	H5I_DATASET	INTEGER	1300775
7	/mm10	shape	H5I_DATASET	INTEGER	2

# Load data
data <- h5read(fn, "mm10", bit64conversion = "double")

Error in H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem,  : 
  the dims contain negative values
Error in H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem,  : 
  the dims contain negative values

The data and the indices entries were NULL, so I think the errors were from them. Everything else loaded fine. On an Amazon EC2 instance, I used this function to load the two long vectors piecemeal; it worked, but it took a while. It DID take over half an hour to load the entire thing as suggested on www.msmith.de, forgot exactly how long; it seems that R was only using 1 core when loading the data; probably I should have used mclapply in the piecemeal loading function. Anyway, since the Matrix package doesn't support long vectors yet, I can't turn the full dataset into a sparse matrix with sparseMatrix, so right now, I'm working with a subset of the data.

h5read_piecemeal <- function(file, name, size_total, size_piece, ...) {
  n_pieces <- ifelse(size_total %% size_piece == 0, 
                     size_total %/% size_piece, 
                     size_total %/% size_piece + 1)
  pieces <- seq_len(n_pieces)
  inds <- lapply(pieces, 
                 function(x) {
                   if (x != max(pieces))
                     ((x - 1) * size_piece + 1):(x * size_piece)
                   else 
                     ((x - 1) * size_piece + 1):size_total})
  data_pieces <- lapply(pieces, function(i) h5read(file, name, index = inds[i], ...))
  Reduce(c, data_pieces)
}

I have a reduced version, which loaded perfectly fine and pretty quickly just with h5read, without loading it piecemeal; it won't reproduce the problem. I think the problem is with the size of those two vectors. Here's the link to the reduced dataset: http://cf.10xgenomics.com/samples/cell-exp/1.3.0/1M_neurons/1M_neurons_neuron20k.h5

grimbough / rhdf5

Support of long vectors #8