Closed Neah-Ko closed 11 months ago
Thanks, that indeed seems like a big oversight. I think there’s more places in the code that could benefit from that.
Would you be interested in making a PR?
@flying-sheep For sure, I've just opened one. First time contributor here, I've tried to respect the guidelines, let me know if that looks fine. How can I pass GPU tests and check milestone ?
Remarks and Questions:
I had to change the get_size()
call from passing AnnData._X
to AnnData.X
AnnData._X
can be equal to None
for backed datasets, in this case AnnData.X
returns the BaseCompressedSparseDataset
You may check this on the same dataset as the issue above.For now I'm checking and applying special logic when type(AnnData.X)
$\in$ {h5py._hl.dataset.Dataset
, scipy.sparse._csr.csr_matrix
, BaseCompressedSparseDataset
}
X
?Should I add some unit tests ?
I commented on the PR! Yeah, unit tests would be great, especially since you change _X
to X
which surely changes behavior in some cases.
type(AnnData.X) ∈ {...}
You mean isinstance(AnnData.X, (...))
. Subclasses are a thing, don’t use type(...) is Y
checks! For an answer, see the PR.
Please make sure these conditions are met
Report
I am working with genomics datasets and would like to know the size of a dataset before loading it into memory (to avoid inconsiderately nuking my cluster).
After experimenting, I noticed that #471 brings that feature. However it wasn't working as intended on my data.
To reproduce, download this dataset from cellxgene discover ->
Dissection: Cerebellum (CB) - Cerebellar Vermis - CBV
-> Download -> .h5adObservations
__sizeof__ behaviour
scipy.sparse.issparse and scipy.sparse.csr_matrix behaviors
Current Implementation
The AnnData.__sizeof__() function uses
issparse
check then casts intocsr_matrix
(realizing the data) in order to compute the size.From my above findings, I have shown that (at least on some cases) the issparse path is not explored thus not computing the size of sparse matrices. Also, if it did, then the function would indeed return the correct size of the sparse matrix. However, data would subsequently be realized.
Suggested fixes
For the size we could retrieve the information in a very similar fashion, but avoid realization
The scipy.sparse.issparse check is equivalent to:
Which never returns true if
X
is an instance ofCSRDataset
orCSCDataset
. I would suggest that we implement our own issparse that checks against BaseCompressedSparseDataset which is the parent class.Let me know if you are interested in the fixes and I will extend a pull request.
Best,
Versions