Read alternative data with AnnData2SCE

GabrielHoffman commented 2 years ago

Thanks for developing this package, its been super useful for handling large datasets in R.

I have a H5AD file where the X slot stores normalized counts while raw counts are stored in raw/X. I would like to use readH5AD() to read in raw/X.

However it looks like X is hard coded in AnnData2SCE: https://github.com/theislab/zellkonverter/blob/5e928bfa9b205ab1d507fc3893123394a2769f97/R/konverter.R#L106

It seems easy enough to add an argument to make this more flexible.

Is it more complicated than that? If I make the change would you want to incorporate it into the main branch?

From a user perspective, I had thought that the X_name argument would do this, but it names the assay rather then specifying where the data is.

Cheers, Gabriel

GabrielHoffman commented 2 years ago

I implemented a user defined target argument that seems to work well. I'm doing some more testing, and I can push when its finish

LTLA commented 2 years ago

The general idea sounds sensible to me. However, I'm curious whether the raw things are part of the H5AD standard or not, because that has implications for a few things beyond zellkonverter. For example, the file-backed H5ADMatrix classes assume that the matrix is either X or layer/*, and they may refuse to play ball if the matrix is somewhere else.

GabrielHoffman commented 2 years ago

Hi Aaron,

1) Is raw a "standard" attribute in H5AD files?

it looks like the raw field is supported by anndata here, but I don't have much python experience.
Pegasus uses the raw/X field to store unnormalized raw counts, and uses X to store normalized counts. See example. This is how I ran into this issue: my colleage processed our single cell RNA-seq data with pegasus and I'm trying to do some downstream analysis using the raw counts.

Given the wide adoption of Pegasus and its use of raw/X in H5AD, it seems important to support this field for downstream analyses.

2) Support of raw/X field by other tools: Thanks for pointing out that H5ADMatrix only supports X or layers/*. Wider support of raw/X is certainly important.

However, zellkonverter::readH5AD() does'nt depend on this class, and so doesn't prevent an isolated improvement. I implemented a new argument zellkonverter::readH5AD(...,target="X") that is passed to AnnData2SCE().

Here is my fork with the change to support other paths.

Best, Gabriel

lazappi commented 2 years ago

Thanks for the suggestion and the code! Issue #53 is also about supporting the raw slot and I'm hoping to squeeze this into the next release.

lazappi commented 2 years ago

You should be able to do this now with the devel version, you will just need to set raw = TRUE in readH5AD(). This will add an altExp called "raw" to the returned SingleCellExperiment object. See ?altExps for details about how to use alternative experiments.

GabrielHoffman commented 2 years ago

@lazappi Thanks for this fix. It works great!

theislab / zellkonverter

Read alternative data with AnnData2SCE #57