scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
555 stars 150 forks source link

Unable to `write` in backed mode when `X` doesn't exist #715

Closed kaizhang closed 1 year ago

kaizhang commented 2 years ago

Version: 0.8.0rc1

Example:

>>> import anndata as ad
>>> a = ad.AnnData(shape=(10, 20))
>>> a
AnnData object with n_obs × n_vars = 10 × 20
>>> a.write("1.h5ad")
>>> b = ad.read("1.h5ad", backed="r+")
>>> b.write("2.h5ad")

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/kaizhang/data/software/miniconda3/lib/python3.8/site-packages/anndata/_core/anndata.py", line 1919, in write_h5ad
    _write_h5ad(
  File "/home/kaizhang/data/software/miniconda3/lib/python3.8/site-packages/anndata/_io/h5ad.py", line 85, in write_h5ad
    write_elem(f, "X", adata.X, dataset_kwargs=dataset_kwargs)
  File "/home/kaizhang/data/software/miniconda3/lib/python3.8/site-packages/anndata/_core/anndata.py", line 612, in X
    X = self.file["X"]
  File "/home/kaizhang/data/software/miniconda3/lib/python3.8/site-packages/anndata/_core/file_backing.py", line 42, in __getitem__
    return self._file[key]
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/home/kaizhang/data/software/miniconda3/lib/python3.8/site-packages/h5py/_hl/group.py", line 288, in __getitem__
    oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5o.pyx", line 190, in h5py.h5o.open
KeyError: "Unable to open object (object 'X' doesn't exist)"
ivirshup commented 2 years ago

Thanks for the report and PR! I can reproduce the bug.

I'm a little conflicted on the right solution here though. When there isn't a value is X, having a backed anndata object doesn't really do anything.

So should you even be able to create a backed object with no X? What would you want to do with it?

kaizhang commented 2 years ago

I did this in my SnapATAC2 package, in which I created an anndata object without X using foreign codes for performance reasons. The "empty" anndata stores base-resolution TN5 insertions in obsm. (BTW, I want obsm to be stored on disk as I don't want to load the insertion count every time into memory. But currently this is not possible even with the new file spec.)

We want the anndata to always be in backed mode as there is not much performance gain by storing X in memory at least for ATAC-seq analysis. Storing X in memory unnecessarily uses a lot of resources for large dataset. We routinely analyze >1M cells.

ivirshup commented 2 years ago

Hearing about what you'd like to do is quite helpful, as we're figuring out next steps for our out-of-core support.


But currently this is not possible even with the new file spec.

If the issue is just wanting a few of the fields loaded at once, I think it's a little possible. I would just grab the fields you want with read_elem for now. E.g.:

adata = ad.AnnData(
    obs=read_elem(f["obs"]),
    obsm={"X_pca": read_elem(f["obsm/X_pca"])},
    shape=f["X"].shape
)

Still trying to figure out the user facing API for this though.

kaizhang commented 2 years ago

I implemented an experimental Rust port of anndata which operates in an out-of-core fashion. Individual elements can be put into memory by enabling cache. See an example here: https://github.com/kaizhang/anndata-rs.

I use this to analyze multiple anndata files simultaneously and create aggregated AnnDataSet (much similar to AnnDataCollection but it is entirely stored on disk and has lazy access to underlying anndata files). Example: https://kzhang.org/SnapATAC2/tutorials/integration.html

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!

ivirshup commented 1 year ago

I'm going to close this as not planned.

I think we're going to go with a different approach to backed data where everything can be backed. This won't be built on the existing backed mode.