scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
577 stars 154 forks source link

Read anndata from file-like objects (specifically, in "backed" mode) #1041

Open thetorpedodog opened 1 year ago

thetorpedodog commented 1 year ago

h5py happily reads file-like objects, and anndata kind of partially supports doing so. However, anndata only accepts os.PathLike objects. It first opens them with h5py.File in read_h5ad_backed:

https://github.com/scverse/anndata/blob/a5fc41a09b7ef059860d125d653a777418f6d2be/anndata/_io/h5ad.py#L128

But then passes it to AnnDataFileManager:

https://github.com/scverse/anndata/blob/a5fc41a09b7ef059860d125d653a777418f6d2be/anndata/_core/anndata.py#L397-L398

which, in the filename setter, it tries to extract the path from the PathLike. For ordinary file-like objects, which are not PathLike, opening will fail here even though h5ad could handle it. Then, in the remainder of the open call, it will pass the extracted path to h5py.File, meaning that the h5ad file gets reopened from the filesystem.

https://github.com/scverse/anndata/blob/a5fc41a09b7ef059860d125d653a777418f6d2be/anndata/_core/file_backing.py#L57-L72

Ideally, only one h5ad.File would ever be created, directly from the passed-in PathLike or file-like object. For starters it would be nice if the same file-like object were passed to both calls.

We currently use an ugly monkey-patch to accomplish this:

from unittest import mock

def _hack_patch_anndata() -> ContextManager[object]:
    from anndata._core import file_backing

    @file_backing.AnnDataFileManager.filename.setter
    def filename(self, filename) -> None:
        self._filename = filename

    return mock.patch.object(file_backing.AnnDataFileManager, "filename", filename)

and then do, for example,

h5ad_bytes = io.BytesIO(...)

with _hack_patch_anndata():
    the_data = anndata.read_h5ad(h5ad_bytes)

# do stuff with the_data. this can be outside the with-block.
ivirshup commented 1 year ago

Thanks for opening the issue.

Are you specifically looking to read an object in backed mode, or just read it?

Because you can do:

import anndata as ad, numpy as np
import h5py, io

a = ad.AnnData(np.random.randn(1000, 500))

bio = io.BytesIO()
with h5py.File(bio, 'w') as f:
    ad.experimental.write_elem(f, "/", a)
    b = ad.experimental.read_elem(f["/"])
thetorpedodog commented 1 year ago

We're specifically looking to read the object and interact extensively with the anndata read from the h5ad file. We use backed mode because we neither need nor want the entire thing read into memory at once. And in any case, I would characterize opening the same h5ad file twice as a (minor) bug.

flying-sheep commented 1 year ago

in any case, I would characterize opening the same h5ad file twice as a (minor) bug.

I agree with this one. There’s definitely problem here: #719, #522