scverse / anndata

Annotated data.
http://anndata.readthedocs.io
BSD 3-Clause "New" or "Revised" License
575 stars 152 forks source link

scipy.sparse.issparse check is always false in AnnData.__sizeof__() method + csr_matrix() realizes data #1222

Closed Neah-Ko closed 11 months ago

Neah-Ko commented 11 months ago

Please make sure these conditions are met

Report

I am working with genomics datasets and would like to know the size of a dataset before loading it into memory (to avoid inconsiderately nuking my cluster).

After experimenting, I noticed that #471 brings that feature. However it wasn't working as intended on my data.

To reproduce, download this dataset from cellxgene discover -> Dissection: Cerebellum (CB) - Cerebellar Vermis - CBV -> Download -> .h5ad

Observations

__sizeof__ behaviour

>>> import anndata as ad
>>> DS = ad.read_h5ad("local.h5ad")
>>> DS_b = ad.read_h5ad("local.h5ad", backed='r')
>>> DS.__sizeof__(show_stratified=True)
Size of .X     : 1405.42 MB
Size of .obs   : 15.47 MB
Size of .var   : 30.05 MB
Size of .uns   : 0.00 MB
Size of .obsm  : 1.65 MB
1523140787

>>> DS_b.__sizeof__(show_stratified=True)
Size of .X     : 0.00 MB
Size of .obs   : 15.47 MB
Size of .var   : 30.05 MB
Size of .uns   : 0.00 MB
Size of .obsm  : 1.65 MB

scipy.sparse.issparse and scipy.sparse.csr_matrix behaviors

>>> from scipy.sparse import csr_matrix, issparse
>>> issparse(DS_b.X)
False
>>> issparse(DS_b.X._to_backed())
True
>>> type(DS_b.X)
<class 'anndata._core.sparse_dataset.CSRDataset'>
>>> type(DS_b.X._to_backed())
<class 'anndata._core.sparse_dataset.backed_csr_matrix'>

>>> csr_matrix(DS_b.X)
Traceback (most recent call last):
[...]
TypeError: no supported conversion for types: (dtype('O'),)

>>> csr_matrix(DS_b.X._to_backed())
[!! REALIZES the data]
<71874x59357 sparse matrix of type '<class 'numpy.float32'>'
    with 184175367 stored elements in Compressed Sparse Row format>

Current Implementation

The AnnData.__sizeof__() function uses issparse check then casts into csr_matrix (realizing the data) in order to compute the size.

From my above findings, I have shown that (at least on some cases) the issparse path is not explored thus not computing the size of sparse matrices. Also, if it did, then the function would indeed return the correct size of the sparse matrix. However, data would subsequently be realized.

Suggested fixes

For the size we could retrieve the information in a very similar fashion, but avoid realization

>>> csr_mat = DS_b.X._to_backed() # Not realizing data
>>> (csr_mat.data.nbytes + csr_mat.indptr.nbytes + csr_mat.indices.nbytes)/1024**2
1405.4207191467285

The scipy.sparse.issparse check is equivalent to:

isinstance(x, _spbase)

Which never returns true if X is an instance of CSRDataset or CSCDataset. I would suggest that we implement our own issparse that checks against BaseCompressedSparseDataset which is the parent class.

Let me know if you are interested in the fixes and I will extend a pull request.

Best,

Versions

>>> import anndata, session_info; session_info.show(html=False, dependencies=True)
-----
anndata             0.10.3
scipy               1.11.2
session_info        1.0.0
-----
cython_runtime      NA
dateutil            2.8.2
h5py                3.9.0
natsort             8.4.0
numpy               1.24.4
packaging           23.1
pandas              2.1.0
pytz                2023.3.post1
sitecustomize       NA
six                 1.16.0
-----
Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0]
Linux-6.1.0-13-amd64-x86_64-with-glibc2.36
-----
Session information updated at 2023-10-31 16:08
flying-sheep commented 11 months ago

Thanks, that indeed seems like a big oversight. I think there’s more places in the code that could benefit from that.

Would you be interested in making a PR?

Neah-Ko commented 11 months ago

@flying-sheep For sure, I've just opened one. First time contributor here, I've tried to respect the guidelines, let me know if that looks fine. How can I pass GPU tests and check milestone ?

Remarks and Questions:

flying-sheep commented 11 months ago

I commented on the PR! Yeah, unit tests would be great, especially since you change _X to X which surely changes behavior in some cases.

type(AnnData.X) ∈ {...}

You mean isinstance(AnnData.X, (...)). Subclasses are a thing, don’t use type(...) is Y checks! For an answer, see the PR.