scipy.sparse.issparse check is always false in AnnData.__sizeof__() method + csr_matrix() realizes data

Neah-Ko commented 11 months ago

Please make sure these conditions are met

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of anndata.
[X] (optional) I have confirmed this bug exists on the master branch of anndata.

Report

I am working with genomics datasets and would like to know the size of a dataset before loading it into memory (to avoid inconsiderately nuking my cluster).

After experimenting, I noticed that #471 brings that feature. However it wasn't working as intended on my data.

To reproduce, download this dataset from cellxgene discover -> Dissection: Cerebellum (CB) - Cerebellar Vermis - CBV -> Download -> .h5ad

Observations

sizeof behaviour

>>> import anndata as ad
>>> DS = ad.read_h5ad("local.h5ad")
>>> DS_b = ad.read_h5ad("local.h5ad", backed='r')
>>> DS.__sizeof__(show_stratified=True)
Size of .X     : 1405.42 MB
Size of .obs   : 15.47 MB
Size of .var   : 30.05 MB
Size of .uns   : 0.00 MB
Size of .obsm  : 1.65 MB
1523140787

>>> DS_b.__sizeof__(show_stratified=True)
Size of .X     : 0.00 MB
Size of .obs   : 15.47 MB
Size of .var   : 30.05 MB
Size of .uns   : 0.00 MB
Size of .obsm  : 1.65 MB

scipy.sparse.issparse and scipy.sparse.csr_matrix behaviors

>>> from scipy.sparse import csr_matrix, issparse
>>> issparse(DS_b.X)
False
>>> issparse(DS_b.X._to_backed())
True
>>> type(DS_b.X)
<class 'anndata._core.sparse_dataset.CSRDataset'>
>>> type(DS_b.X._to_backed())
<class 'anndata._core.sparse_dataset.backed_csr_matrix'>

>>> csr_matrix(DS_b.X)
Traceback (most recent call last):
[...]
TypeError: no supported conversion for types: (dtype('O'),)

>>> csr_matrix(DS_b.X._to_backed())
[!! REALIZES the data]
<71874x59357 sparse matrix of type '<class 'numpy.float32'>'
    with 184175367 stored elements in Compressed Sparse Row format>

Current Implementation

The AnnData.__sizeof__() function uses issparse check then casts into csr_matrix (realizing the data) in order to compute the size.

From my above findings, I have shown that (at least on some cases) the issparse path is not explored thus not computing the size of sparse matrices. Also, if it did, then the function would indeed return the correct size of the sparse matrix. However, data would subsequently be realized.

Suggested fixes

For the size we could retrieve the information in a very similar fashion, but avoid realization

>>> csr_mat = DS_b.X._to_backed() # Not realizing data
>>> (csr_mat.data.nbytes + csr_mat.indptr.nbytes + csr_mat.indices.nbytes)/1024**2
1405.4207191467285

The scipy.sparse.issparse check is equivalent to:

isinstance(x, _spbase)

Which never returns true if X is an instance of CSRDataset or CSCDataset. I would suggest that we implement our own issparse that checks against BaseCompressedSparseDataset which is the parent class.

Let me know if you are interested in the fixes and I will extend a pull request.

Best,

Versions

>>> import anndata, session_info; session_info.show(html=False, dependencies=True)
-----
anndata             0.10.3
scipy               1.11.2
session_info        1.0.0
-----
cython_runtime      NA
dateutil            2.8.2
h5py                3.9.0
natsort             8.4.0
numpy               1.24.4
packaging           23.1
pandas              2.1.0
pytz                2023.3.post1
sitecustomize       NA
six                 1.16.0
-----
Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0]
Linux-6.1.0-13-amd64-x86_64-with-glibc2.36
-----
Session information updated at 2023-10-31 16:08

flying-sheep commented 11 months ago

Thanks, that indeed seems like a big oversight. I think there’s more places in the code that could benefit from that.

Would you be interested in making a PR?

Neah-Ko commented 11 months ago

@flying-sheep For sure, I've just opened one. First time contributor here, I've tried to respect the guidelines, let me know if that looks fine. How can I pass GPU tests and check milestone ?

Remarks and Questions:

I had to change the get_size() call from passing AnnData._X to AnnData.X
- AnnData._X can be equal to None for backed datasets, in this case AnnData.X returns the BaseCompressedSparseDataset You may check this on the same dataset as the issue above.
For now I'm checking and applying special logic when type(AnnData.X) $\in$ {h5py._hl.dataset.Dataset, scipy.sparse._csr.csr_matrix, BaseCompressedSparseDataset}
- Those are representative of the datasets I have downloaded
- Could there be other classes to handle from X ?
Should I add some unit tests ?

flying-sheep commented 11 months ago

I commented on the PR! Yeah, unit tests would be great, especially since you change _X to X which surely changes behavior in some cases.

type(AnnData.X) ∈ {...}

You mean isinstance(AnnData.X, (...)). Subclasses are a thing, don’t use type(...) is Y checks! For an answer, see the PR.

scverse / anndata