Closed acmullen-med closed 3 years ago
hi @acmullen-med
There is no size limit with the model, it seems you have some cells in ctrl_adata.X = ctrl_adata.X.A
which are causing this problem, have they been normalized together? ctrl_adata.X is sparse or dense? if dense please pass it as sparse, seems that you have a float conversion problem with scipy and nothing related to the model.
maybe try to convert, it seems you have integers in there maybe cast ctrl_adata.X beforehand to float and try.
and also maybe try to upgrade scipy
Hey @M0hammadL thanks for the response but I'm still having trouble getting the package working on larger datasets.
Looking through the package code I believe ctrl_adata.X is from adata that was added during initialization of the model and is sparse but can't be sure without altering the package code. The adata.X that goes into the model is a sparse csr matrix and adata.X.A is an ndarray of floats. I don't quite know what you mean by normalizing ctrl_adata.X and ctrl_adata.X.A together. Line 155 of _scgen.py seems to be trying to set them equal? What should be normalized?
Could you provide some clarity on what the differences are between ctrl_adata.X and ctrl_adata.X.A?
Even when I try to cast both adata.X and adata.X.A as sparse matrices, I get the same error.
>>> train = sc.read('/net/trapnell/vol1/home/acmullen/VAEs/data/fishVAEData.h5ad')
>>> adata = sc.AnnData(train)
>>> adata.X = adata.X.tocsr()
>>> adata.X.A = scipy.sparse.csr_matrix(adata.X.A)
>>> train_new = scgen.setup_anndata(adata,copy=True,batch_key="timepoint", labels_key="gene_target")
>>> model = scgen.SCGEN(train_new)
>>> scipy.sparse.issparse(model.adata.X)
True
>>> type(model.adata.X)
<class 'scipy.sparse.csr.csr_matrix'>
>>> type(model.adata.X.A)
<class 'numpy.ndarray'>
>>> model.train(
max_epochs=100,
batch_size=32,
early_stopping=True,
early_stopping_patience=25
)
>>> pred, delta = model.predict(
... ctrl_key='18h',
... stim_key='24h',
... celltype_to_predict='tbx16'
... )
Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
Traceback (most recent call last):
File "<stdin>", line 4, in <module>
File "/net/trapnell/vol1/home/acmullen/.local/lib/python3.7/site-packages/scgen/_scgen.py", line 155, in predict
ctrl_adata.X = ctrl_adata.X.A
File "/net/gs/vol3/software/modules-sw-python/3.7.7/scvi-tools/0.10.1/Linux/CentOS7/x86_64/lib/python3.7/site-packages/anndata/_core/anndata.py", line 684, in X
self._adata_ref._X[oidx, vidx] = value
File "/net/gs/vol3/software/modules-sw-python/3.7.7/scvi-tools/0.10.1/Linux/CentOS7/x86_64/lib/python3.7/site-packages/scipy/sparse/_index.py", line 116, in __setitem__
self._set_arrayXarray_sparse(i, j, x)
File "/net/gs/vol3/software/modules-sw-python/3.7.7/scvi-tools/0.10.1/Linux/CentOS7/x86_64/lib/python3.7/site-packages/scipy/sparse/compressed.py", line 808, in _set_arrayXarray_sparse
self._zero_many(*self._swap((row, col)))
File "/net/gs/vol3/software/modules-sw-python/3.7.7/scvi-tools/0.10.1/Linux/CentOS7/x86_64/lib/python3.7/site-packages/scipy/sparse/compressed.py", line 929, in _zero_many
i, j, offsets)
ValueError: could not convert integer scalar
If I try to recreate the ctrl_adata.X bug using my code I get this and it does not error.
>>> model.adata.X = model.adata.X.A
>>> model.adata.X
array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 1., 0.]], dtype=float32)
>>>
Additionally I found peculiar behavior below when generating the model. Why would setting up the model recast model.adata.X.A to no longer be sparse? Could this be related?
>>> adata = sc.AnnData(train)
>>> adata.X = adata.X.tocsr()
>>> adata.X.A = scipy.sparse.csr_matrix(adata.X.A)
>>>
>>> train_new = scgen.setup_anndata(adata,copy=True,batch_key="timepoint", labels_key="gene_target")
INFO Using batches from adata.obs["timepoint"]
INFO Using labels from adata.obs["gene_target"]
INFO Using data from adata.X
INFO Computing library size prior per batch
INFO Successfully registered anndata object containing 186289 cells, 32031 vars, 2 batches, 3 labels, and 0 proteins.
Also registered 0 extra categorical covariates and 0 extra continuous covariates.
INFO Please do not further modify adata until model is trained.
>>> model = scgen.SCGEN(train_new)
>>>
>>> #Why are these both not sparse???
>>> scipy.sparse.issparse(adata.X.A)
True
>>> scipy.sparse.issparse(model.adata.X.A)
False
When I try to recast model.adata.X.A after initializing and training the model, the original error persists. So maybe not the problem?
Let me know if you have any other suggestions or need more information from me.
I have a resolution.
I added the two lines with Asterix to _scgen.py.
eq = min(ctrl_x.X.shape[0], stim_x.X.shape[0])
cd_ind = np.random.choice(range(ctrl_x.shape[0]), size=eq, replace=False)
stim_ind = np.random.choice(range(stim_x.shape[0]), size=eq, replace=False)
ctrl_adata = ctrl_x[cd_ind, :]
stim_adata = stim_x[stim_ind, :]
154 **ctrl_adata = ctrl_adata.copy()
155 stim_adata= stim_adata.copy()**
if sparse.issparse(ctrl_adata.X) and sparse.issparse(stim_adata.X):
ctrl_adata.X = ctrl_adata.X.A
stim_adata.X = stim_adata.X.A
When subsetting an adata frame a view is created instead of a new adata object. This is to save on memory link: https://anndata.readthedocs.io/en/latest/anndata.AnnData.html.
The adata.copy() forces the generation of a new adata object instead of a view.
I don't know why the downstream bug emerges when the ctrl_adata reaches a large enough threshold. That could be an issue with scvi-tools or scipy. But altering the package code fixed this issue for me. I would encourage you to add these two lines to the package or find an improved solution.
Thanks,
I am trying to replicate the results of the perturbation experiment with some of my own data but am running into an error
I found https://github.com/theislab/anndata/issues/339 and tried
but got the same result.
I had no issues with training. Strangely, I don't seem to have this issue if I subset the data and only predict on 5% of data or if I use the sample data provided. Is there a size limit for the number of cells that this package will work with?
I am not using conda.
Relevant packages: sc.version '1.7.2' scipy.version '1.6.3' np.version '1.20.3' anndata.version '0.7.6' scgen.version '2.0.0'
I have tried to update my libraries to the most current versions. Any guidance you could provide would be great