scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
583 stars 155 forks source link

Additional fields in AnnData h5ad and zarr files #1638

Closed colganwi closed 3 months ago

colganwi commented 3 months ago

I'm working on an extension of the AnnData object called TreeData which adds two additional fields obst and vart for storing nx.DiGraph trees for the obs and var axes. The primary use case is single cell lineage tracing experiments where you have a tree relating the cells to each other.

The treedata package is very lightweight since it inherits most of its functionality from anndata. treedata uses the anndata h5ad and zarr file formats and I would like to extend the anndata readers and writers in this way:

import anndata as ad
import treedata as td
import h5py
from anndata._io import write_h5ad

# Writing
tdata = td.TreeData()
write_h5ad(filename, tdata)
with h5py.File(filename, "a") as f:
     f.create_dataset("obst", data=serialize(tdata.obst))
     f.create_dataset("vart", data=serialize(tdata.vart))

# Reading
adata = ad.read_h5ad(filename)
with h5py.File(filename, "r") as f:
     obst = f["obst"]
     vart = f["vart"]
tdata = td.TreeData(adata, obst=obst, vart=vart)

This solution ensures compatibility with anndata and minimizes duplicated code, but unfortunately is not possible with the current anndata IO implementation, since additional fields are not allowed in the h5ad and zarr files. I would like to update the anndata IO functions to allow additional fields (for example h5ad.py#L245 would only parse expected fields) but before submitting a PR I want to get the developers thoughts. This change would have no effect on the anndata API or structure of anndata h5ad and zarr files but would make it easier to extend the file format to include additional fields.

ilan-gold commented 3 months ago

@colganwi Before starting out on this path, I wonder if you'd be interested in trying out one of our APIs for reading a bit more cleanly (I noticed you don't use them):

https://anndata.readthedocs.io/en/stable/generated/anndata.experimental.read_elem.html

and

https://anndata.readthedocs.io/en/stable/generated/anndata.experimental.read_dispatched.html

This would also help us implement this here if that still makes sense, since we use both of these internally.

colganwi commented 3 months ago

@ilan-gold thanks for suggesting the IO API. Based on some digging I think the best solution may be for me to reimplement the anndata h5ad and zarr read/write functions using the API. The duplicate code would be fairly minimal:

# Reading
with h5py.File(filename, "r") as f:
    d = {}
    for k in [
        "X",
        "obs",
        "var",
        "obsm",
        "varm",
        "obsp",
        "varp",
        "layers",
        "uns",
        "raw",
        "obst",
        "vart",
    ]:
        if k in f:
            d[k] = ad.experimental.read_elem(f[k])
tdata = td.TreeData(**d)

Does this solution make sense to you? Given anndata's current field constraints the files could only be read by treedata but since the TreeData object can be converted to the AnnData I don't think this is a big issue.

Is ad.experimental.read_elem the most stable way to load this API? I would like to future proof this implementation as much as possible.

ilan-gold commented 3 months ago

Is ad.experimental.read_elem the most stable way to load this API?

Yes we are exporting this a stable API with a deprecation on experimental so you'll have time to switch.

Does this solution make sense to you?

It does. We are thinking of exporting the list of axes (obsp, uns etc.) at some point so stay tuned!

If that's all, feel free to close or open a new issue for a more specific request!