grimbough / rhdf5

Package providing an interface between HDF5 and R
http://bioconductor.org/packages/rhdf5
61 stars 22 forks source link

h5writeDataset to allow for sparse matrix as input #33

Open allisonvuong opened 5 years ago

allisonvuong commented 5 years ago

Hi,

Many Bioconductor packages store single-cell RNASeq data in sparse matrices in-memory. It seems like currently, rhdf5::h5writeDataset does not support a sparse matrix as input. For smaller matrices, I can simply coerce my dgCMatrix into a normal Matrix as pass this to h5write, but for extremely large matrices, I cannot because I run out of memory. Thus instead, I am bringing a subset of the sparse matrix into memory, coercing it into a normal matrix, and then calling h5writeDataset on a hyperslab.

Is it possible to support sparse matrices as input?

Best, Allison

grimbough commented 5 years ago

Thanks for the suggestion. I'll look into it, certainly seems like it would be useful. I haven't looked, but does https://github.com/Bioconductor/HDF5Array provide any support for writing sparce matrices?

allisonvuong commented 5 years ago

Yes, I think so. HDF5Array::writeHDF5Array() seems to call BLOCK_write_to_sink in DelayedArray which looks to be performing iterative conversion. See: here

mathewchamberlain commented 5 years ago

Hi Allison,

This is already implemented in DropletUtils:

' @importFrom rhdf5 h5createFile h5createGroup h5write

' @importFrom methods as

' @importClassesFrom Matrix dgCMatrix

.write_hdf5 <- function(path, genome, x, barcodes, gene.id, gene.symbol, gene.type, version="3") { h5createFile(path)

if (version=="3") {
    group <- "matrix"
} else {
    group <- genome
}
h5createGroup(path, group)

h5write(barcodes, file=path, name=paste0(group, "/barcodes"))

# Saving feature information.
if (version=="3") {
    h5createGroup(path, file.path(group, "features"))
    h5write(gene.id, file=path, name=paste0(group, "/features/id"))
    h5write(gene.symbol, file=path, name=paste0(group, "/features/name"))

    h5write(rep(gene.type, length.out=length(gene.id)),
        file=path, name=paste0(group, "/features/feature_type"))

    h5write(rep(genome, length.out=length(gene.id)),
        file=path, name=paste0(group, "/features/genome"))

} else {
    h5write(gene.id, file=path, name=paste0(group, "/genes"))
    h5write(gene.symbol, file=path, name=paste0(group, "/gene_names"))
}

# Saving matrix information.
x <- as(x, "dgCMatrix")
h5write(x@x, file=path, name=paste0(group, "/data"))
h5write(dim(x), file=path, name=paste0(group, "/shape"))
h5write(x@i, file=path, name=paste0(group, "/indices")) # already zero-indexed.
h5write(x@p, file=path, name=paste0(group, "/indptr"))

return(NULL)

}

-Mat