theislab / zellkonverter

Conversion between scRNA-seq objects
https://theislab.github.io/zellkonverter/
Other
145 stars 27 forks source link

Support for Anndata SparseDataset? #37

Closed milanmlft closed 3 years ago

milanmlft commented 3 years ago

I'm having some trouble reading in a dataset for which the X matrix was stored as anndata._core.sparse_dataset.SparseDataset. On the Python side, I can read it as HDF5-backed and everything works fine, but when I try to read it into R with zellkonverter::readH5AD() I get

Warning message:
 In .extract_or_skip_assay(skip_assays = skip_assays, hdf5_backed = hdf5_backed,  :
   'X' matrix does not support transposition and has been skipped

Which then results in an empty sparse matrix. After some digging I noticed that it's due to the anndata._core.sparse_dataset.SparseDataset not being recognized as HDF5-format (though it is, I think?) and .extract_or_skip_assay() trying to transpose it manually.

I was wondering whether this is intentionally not supported and if so, whether it will be at some point?

I have a workaround where I load the data on the Python side and then write it back out with anndata.write_h5ad(file, force_dense = True) but I was wondering whether there is a more elegant solution?

If necessary I can provide a small reproducible example.

Thanks a lot!

LTLA commented 3 years ago

This thing isn't a "HDF5 dataset" in the standard sense. It's an AnnData-custom structure, akin to how 10x stores their sparse matrices in their HDF5 files (but transposed). I suppose we could handle it via HDF5Array::TENxMatrix and transpose the output, assuming we can pull the relevant paths out.

Workaround is just to use use_hdf5=FALSE for the time being.

milanmlft commented 3 years ago

Alright, thanks for the feedback! It would be useful to have that support I think because it seems that AnnData encourages the use of this custom format in their write documentation.

Workaround is just to use use_hdf5=FALSE for the time being.

I have 2 1 problems with this:

  1. The actual data I'm working with is over 1 million cells so even loading it as a sparse matrix will probably take up too much memory to do anything useful with (on my laptop). 2. More importantly, when I do try to load it with use_hdf5=FALSE, I actually get the exact same warning and end up with an empty sparse X matrix. I guess because .extract_or_skip_assay() is still running into the same problem of trying to transpose a non-supported object?
LTLA commented 3 years ago

In-memory sparse reading works fine for me:

library(scRNAseq)
out <- ZeiselBrainData()
counts(out) <- as(counts(out), "dgCMatrix")

library(zellkonverter)
writeH5AD(out, file="stuff.h5")

out <- readH5AD("stuff.h5") # defaults to use_hdf5=FALSE

library(SummarizedExperiment)
class(assay(out))
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
Matrix::nnzero(assay(out))
## [1] 11349080

Also 1.1.3.

milanmlft commented 3 years ago

OK, I see that I made a stupid typo in my example 🤦‍♂️ I used krumsiek twice instead of using krumsiek2 in the second case. It does indeed work fine when loading in-memory. I'll edit my comment to remove that example. Sorry for the confusion!

But I guess my first issue still stands.

LTLA commented 3 years ago

This can now be done, thanks to Bioconductor/HDF5Array#34. To illustrate:

library(scRNAseq)
out <- ZeiselBrainData()
counts(out) <- as(counts(out), "dgCMatrix")

library(zellkonverter)
writeH5AD(out, file="stuff.h5")

out <- H5ADMatrix("stuff.h5", "X")
out
## <20006 x 3005> sparse matrix of class H5ADMatrix and type "double":
##             [,1]    [,2]    [,3] ... [,3004] [,3005]
##     [1,]       0       0       0   .       0       1
##     [2,]       3       1       0   .       0       1
##     [3,]       3       1       6   .       0       0
##     [4,]       0       0       0   .       0       0
##     [5,]       1       1       1   .       0       0
##      ...       .       .       .   .       .       .
## [20002,]      65      97      64   .      29     120
## [20003,]      75     111      95   .      20     143
## [20004,]     158     326     209   .      36     359
## [20005,]      31      88      97   .      12      52
## [20006,]      13      14       9   .       3      13

Support should be as easy as slapping in:

} else if (hdf5_backed && any(grepl("^anndata\\..*\\.SparseDataset", class(mat)))) {
    # It's H5AD's custom format, so let's handle it with the H5ADMatrix object.
    file <- as.character(mat$file$id$name)
    name <- as.character(mat$name)
    mat <- HDF5Array::H5ADMatrix(file, name)
} else {

... inside .extract_or_skip_assay().

lazappi commented 3 years ago

Nice! Added to the list.

milanmlft commented 3 years ago

Support should be as easy as slapping in:

} else if (hdf5_backed && any(grepl("^anndata\\..*\\.SparseDataset", class(mat)))) {
    # It's H5AD's custom format, so let's handle it with the H5ADMatrix object.
    file <- as.character(mat$file$id$name)
    name <- as.character(mat$name)
    mat <- HDF5Array::H5ADMatrix(file, name)
} else {

... inside .extract_or_skip_assay().

Just tested this on a 1.3M dataset and it works perfectly. Thanks a lot!

hpages commented 3 years ago

Also no need to make a special case for grepl("^h5py\\..*\\.Dataset", class(mat))). HDF5Array::H5ADMatrix() should work with any .h5ad file (with a plain or sparse matrix).