scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
575 stars 152 forks source link

slicing of backed AnnData object throws error #79

Closed bkmartinjr closed 1 year ago

bkmartinjr commented 5 years ago

Test case:

import scanpy.api as sc
import numpy as np
adata = sc.AnnData(X=np.random.binomial(100, .01, (100, 100)))
adata.obs_names = adata.obs_names.astype(str)

# this works fine
adata[0:2,:][:,0:2]

adata.write("tmp.h5ad")
adata_backed = sc.read("tmp.h5ad", backed="r")

# this throws error
adata_backed[0:2,:][:,0:2]

Traceback of final line:

Traceback (most recent call last):
  File "h5py/_objects.pyx", line 200, in h5py._objects.ObjectID.__dealloc__
KeyError: 0
Exception ignored in: 'h5py._objects.ObjectID.__dealloc__'
Traceback (most recent call last):
  File "h5py/_objects.pyx", line 200, in h5py._objects.ObjectID.__dealloc__
KeyError: 0
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/cellxgene/venv/lib/python3.6/site-packages/anndata/base.py", line 1297, in __getitem__
    return self._getitem_view(index)
  File "/cellxgene/venv/lib/python3.6/site-packages/anndata/base.py", line 1301, in _getitem_view
    return AnnData(self, oidx=oidx, vidx=vidx, asview=True)
  File "/cellxgene/venv/lib/python3.6/site-packages/anndata/base.py", line 664, in __init__
    self._init_as_view(X, oidx, vidx)
  File "/cellxgene/venv/lib/python3.6/site-packages/anndata/base.py", line 689, in _init_as_view
    uns_new = deepcopy(self._adata_ref._uns)
  File "/usr/lib/python3.6/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/usr/lib/python3.6/copy.py", line 280, in _reconstruct
    state = deepcopy(state, memo)
  File "/usr/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.6/copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.6/copy.py", line 220, in _deepcopy_tuple
    y = [deepcopy(a, memo) for a in x]
  File "/usr/lib/python3.6/copy.py", line 220, in <listcomp>
    y = [deepcopy(a, memo) for a in x]
  File "/usr/lib/python3.6/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/usr/lib/python3.6/copy.py", line 280, in _reconstruct
    state = deepcopy(state, memo)
  File "/usr/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.6/copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib/python3.6/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/usr/lib/python3.6/copy.py", line 280, in _reconstruct
    state = deepcopy(state, memo)
  File "/usr/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.6/copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib/python3.6/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/usr/lib/python3.6/copy.py", line 280, in _reconstruct
    state = deepcopy(state, memo)
  File "/usr/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.6/copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib/python3.6/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/usr/lib/python3.6/copy.py", line 280, in _reconstruct
    state = deepcopy(state, memo)
  File "/usr/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.6/copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib/python3.6/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/usr/lib/python3.6/copy.py", line 280, in _reconstruct
    state = deepcopy(state, memo)
  File "/usr/lib/python3.6/copy.py", line 150, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python3.6/copy.py", line 240, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib/python3.6/copy.py", line 180, in deepcopy
    y = _reconstruct(x, memo, *rv)
  File "/usr/lib/python3.6/copy.py", line 274, in _reconstruct
    y = func(*args)
  File "stringsource", line 5, in h5py.h5f.__pyx_unpickle_FileID
  File "h5py/_objects.pyx", line 178, in h5py._objects.ObjectID.__cinit__
TypeError: __cinit__() takes exactly 1 positional argument (0 given)
falexwolf commented 5 years ago

Double slicing in backed mode (taking a view of a view in backed mode) is not allowed and we now throw an error: https://github.com/theislab/anndata/commit/2b622f4518d670c3cddd3a861a1a718d13575c15

Why don't you do adata_backed[0:2, 0:2]?

bkmartinjr commented 5 years ago

We are using boolean slicing to allow for complex filtering, and currently double slicing with non-integer or non-slice selectors is not allowed.

In other words, this throws an error:

obs_selector=np.array([True, False, ...])
vars_selector=np.array(False, True, ...])
adata_backed[obs_selector, vars_selector]

And the error message informs that we should try double slicing.

falexwolf commented 5 years ago

OK! Right, double-slicing in memory mode works fine and is currently the only (not nice) way to get submatrices from boolean vectors. In backed-mode, it's quite a bit trickier.

In any-case: if you need this, I'll implement the functionality, maybe even tonight. Then no double slicing is necessary anymore.

bkmartinjr commented 5 years ago

At the moment, we can work around it and don't see a need for you to urgently implement. I ran into the bug because I was benchmarking to determine optimal ways to use anndata in cellxgene. I think the best path would be for us to ship our "MVP", and then have a chat with you about performance. Backed mode will either be useful, or not, based upon that. Seem reasonable?

falexwolf commented 5 years ago

Sounds very reasonable! Let's discuss!

Meanwhile, I think the submatrix extraction via slicing should be relatively straightforward to get via np.ix_() applied to the data matrix and everything else stays as is. As we discussed this already ages ago and you worked quite a bit on the indexing at the time, @flying-sheep, any bandwidth for doing this? It's essentially only making sure that the index normalization produces non-slices and handles pd.Index objects appropriately.

ivirshup commented 5 years ago

Progress was definitely made here, but I'm not sure this issue is totally solved. Double "fancy" indexing over multiple axes isn't supported by h5py datasets. This does work with backed anndata sparse matrices (at least on master).

Side note: It might be possible for zarr dense arrays via get_orthogonal_selection.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. Please add a comment if you want to keep the issue open. Thank you for your contributions!

flying-sheep commented 1 year ago

We throw a meaningful error here and if we ever start supporting it, we’ll announce it.

https://github.com/scverse/anndata/blob/bd47cf9f8df8a6ba745324fb1e760913d99d059b/anndata/tests/test_hdf5_backing.py#L246-L250