theislab / zellkonverter

Conversion between scRNA-seq objects
https://theislab.github.io/zellkonverter/
Other
149 stars 27 forks source link

On disk `layers` in H5AD file #63

Closed GabrielHoffman closed 2 years ago

GabrielHoffman commented 2 years ago

I have an H5AD file that stores both normalized data and raw counts produced by pegasus. I can use zellkonverter to read the default normalized counts as a DelayedMatrix, but the raw counts are imported as a dgCMatrix. How can I use a DelayedMatrix instead?

This follows up on our conversion in #57, but applied to the new H5AD format.

As for the details, I have an H5AD file with the structure:

> h5ls example.h5ad
X                        Group
layers                   Group
obs                      Group
obsm                     Group
obsp                     Group
raw                      Group
uns                      Group
var                      Group
varm                     Group
varp                     Group

where X stores normalized data and layers/raw_new stores the raw counts.

I read the data in using:

sce = readH5AD(file, use_hdf5=TRUE, verbose=TRUE, layers=TRUE) 

class(assay(sce, "X"))
# [1] "DelayedMatrix"
# attr(,"package")
# [1] "DelayedArray"

class(assay(sce, "raw_new"))
# [1] "dgCMatrix"
# attr(,"package")
# [1] "Matrix"

# object.size(assay(sce, "X"))
# 113882552 bytes

object.size(assay(sce, "raw_new"))
# 12781499800 bytes

The raw_new field is a 12Gb dgCMatrix.

I have zellkonverter v1.7.0, Using anndata version 0.8.0

Cheers, Gabriel

lazappi commented 2 years ago

Hi @GabrielHoffman

I think I might know what's going on here but I need to look into it more. Do you have an example file you would be happy to share? A small subset of this data would be perfect.

GabrielHoffman commented 2 years ago

After a lot of digging and working with my colleague, I figured out it's an issue with the way the H5AD was generated with pegasus. Instead of using the standard raw field, using a custom field name causes the raw counts to be saved to layers/customField. The custom field was name raw_new so I didn't realize that it was non-standard.

Here is some code to reproduce the H5AD file with this issue.

import pegasus as pg
import pandas as pd

# wget https://storage.googleapis.com/terra-featured-workspaces/Cumulus/MantonBM_nonmix_subset.zarr.zip
data = pg.read_input('MantonBM_nonmix_subset.zarr.zip')
pg.identify_robust_genes(data)

# Transform counts, but retain original in backup_matrix
# The default is raw.X which saves to h5ad_file/raw like we expect
# When backup_matrix is set to a custom value, the result is saved in layers/customField
# This is what causes the problem, since h5ad_file/raw is no longer written
# https://pegasus.readthedocs.io/en/stable/api/pegasus.log_norm.html#pegasus.log_norm
pg.log_norm(data, backup_matrix = 'raw_new')

pg.write_output( data, "out.h5ad")

Based on this, it's not a zellkonverter issue. It can be resolved by using standard field names, since only standard fields are read as a DelayedMatrix.

Agree?

Best, Gabriel

lazappi commented 2 years ago

Thanks! I still need to test things but I think that makes sense. We should actually be able to support DelayedMatrix in any layers item but there was an upstream issue which meant we turned it off. That should be fixed now and #50 is about turning it back on but we didn't get around to it before the last release. Now that it looks like at least some people want to use it it should be more of a priority though.

lcolladotor commented 2 years ago

Hi!

This is just a longer 👀 message ;)

@nick-eagles @abspangler13 and I are going to be using some of the same files Gabriel is using, and so, we will have the same issues Gabriel described. Thank you Gabriel et al for spearheading this and thanks Luke for your support! From our side, Nick is the one who has R and Python experience, and thus has been our in house zellkonverter expert =) Please let us know if we can help in any way.

Best, Leo

lazappi commented 2 years ago

The main thing that would be helpful would be an example .h5ad file where this is an issue (I don't use pegasus so can't run the code above). Apart from that I need to look into turning the DelayedArray support for layers other than X back on.

GabrielHoffman commented 2 years ago

We have worked out the issue on our end, by writing to raw, so readH5AD() works aspected.

Thanks!

lcolladotor commented 2 years ago

Thanks Gabriel, Prashant, @nick-eagles et al for figuring this one out!

Thanks again Luke for the support ^^.