Open maximilianh opened 5 years ago
The weirdest thing is that if I write this adata object to an h5ad file with adata.write("temp.h5ad"), load it from there and run the same command, it works.
I wonder if this indicates some issue with the .obs object or some version issue...
Even something simple doesn't work anymore, without going through h5ad:
adata = adata[adata.obs['n_genes'] < up_thrsh_genes, :]
Traceback (most recent call last):
File "/cluster/home/max/projects/czi/cellBrowser/src/cbScanpy", line 11, in <module>
cellbrowser.cbScanpyCli()
File "/cluster/home/max/projects/czi/cellBrowser/src/cbPyLib/cellbrowser/cellbrowser.py", line 4655, in cbScanpyCli
adata, params = cbScanpy(matrixFname, metaFname, inCluster, confFname, figDir, logFname)
File "/cluster/home/max/projects/czi/cellBrowser/src/cbPyLib/cellbrowser/cellbrowser.py", line 4353, in cbScanpy
adata = adata[adata.obs['n_genes'] < up_thrsh_genes, :]
File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/anndata/core/anndata.py", line 1224, in __getitem__
return self._getitem_view(index)
File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/anndata/core/anndata.py", line 1228, in _getitem_view
return AnnData(self, oidx=oidx, vidx=vidx, asview=True)
File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/anndata/core/anndata.py", line 557, in __init__
self._init_as_view(X, oidx, vidx)
File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/anndata/core/anndata.py", line 629, in _init_as_view
self._raw = adata_ref.raw[oidx]
File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/anndata/core/anndata.py", line 333, in __getitem__
oidx, vidx = self._normalize_indices(index)
File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/anndata/core/anndata.py", line 361, in _normalize_indices
obs = _normalize_index(obs, self._adata.obs_names)
File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/anndata/core/anndata.py", line 160, in _normalize_index
positions = positions[index]
File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/pandas/core/series.py", line 911, in __getitem__
return self._get_with(key)
File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/pandas/core/series.py", line 946, in _get_with
return self._get_values(key)
File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/pandas/core/series.py", line 980, in _get_values
return self._values[indexer]
IndexError: boolean index did not match indexed array along dimension 0; dimension is 466 but corresponding boolean dimension is 290
I wonder if this has to do with the view discussed in #699. The weird thing is these are very basic operations and I imagine this has come up before for someone else...
Anyhow, I'm closing this, #699 gave me the idea that this is just a very recent problem, it works fine with scanpy 1.4.1, I guess this is already on your radar via #699
This is separate from that, what's happening is that _get_obs_array
had a change a behaviour during a bug fix.
What we should do is
even with scanpy 1.4.1 my very simple (copied from the tutorial) script doesn't work. I'm getting the well-known "TypeError: Categorical is not ordered for operation max you can use .as_ordered() to change the Categorical to an ordered one". So I downgraded anndata, which lead to another new error. I guess I'd also have to downgrade pandas now. This makes me wonder if there is some testing with a standard pipeline done before a release.
The max categorical error was one that I thought was addressed by anndata 0.6.18. I assume this is still on 0.6.22rc1? There was previously a switch from defaulting to ordered categoricals to unordered instead.
There are quite a few unit tests... but clearly not perfect coverage. Others will be able to say more about the coverage than me.
The original bug you hit was with the sc.pl.scatter
which has few tests.
I'd recommend trying out the master branches of AnnData
and scanpy
until new releases can be made in cases like these.
I just got the same error with a similar situation.
I get umap coordinates from a collaborator, which I store in adata.obs
. Before the last update this worked:
sc.pl.scatter(adata, x='UMAP1', y='UMAP2', color='cell_type_class')
Now, this produces a IndexError: Key "UMAP1" is not valid observation/variable name/index.
error.
Now I need to run this for the same plot:
sc.pl.scatter(adata, x='UMAP1', y='UMAP2', color='cell_type_class', use_raw=False)
These covariates are all in adata.obs.keys()
. It seems that use_raw
is taking precendence over x
and y
being from adata.obs
.
Oh, I also get DeprecationWarning
and FutureWarning
about layer='X'
being removed in future and obs_vector
being used, while I assume these are just used in sc.pl.scatter
in the background. I guess this is in the process of being fixed though.
This should be fixed in the v1.4.4
. Could you try that out and see if this is fixed?
Hi Isaac, I've updated to v1.4.4 but I'm still getting this problem. I've finally produced a minimal test case:
import scanpy as sc
sc.logging.print_versions()
#adata = sc.datasets.pbmc3k()
adata = sc.read("orig/transpose_rsem_cell_by_gene.tsv.gz")
print(adata)
adata = adata.T
print(adata)
adata.raw = adata
print(adata)
sc.pp.filter_cells(adata, min_genes=200)
print(adata)
adata = adata[adata.obs['n_genes'] < 5000, :]
print(adata)
adata = adata[adata.obs['n_genes'] > 100, :]
print(adata)
output is:
scanpy==1.4.4.post1 anndata==0.6.22.post1 umap==0.3.9 numpy==1.16.4 scipy==1.3.0 pandas==0.24.2 scikit-learn==0.21.2 statsmodels==0.10.0 python-igraph==0.7.1
Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
AnnData object with n_obs × n_vars = 60498 × 466
AnnData object with n_obs × n_vars = 466 × 60498
AnnData object with n_obs × n_vars = 466 × 60498
AnnData object with n_obs × n_vars = 466 × 60498
obs: 'n_genes'
View of AnnData object with n_obs × n_vars = 311 × 60498
obs: 'n_genes'
Traceback (most recent call last):
File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/pandas/core/series.py", line 977, in _get_values
return self._constructor(self._data.get_slice(indexer),
File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1510, in get_slice
return self.__class__(self._block._slice(slobj),
File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/pandas/core/internals/blocks.py", line 268, in _slice
return self.values[slicer]
IndexError: boolean index did not match indexed array along dimension 0; dimension is 466 but corresponding boolean dimension is 311
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "test.py", line 14, in <module>
adata = adata[adata.obs['n_genes'] > 100, :]
File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/anndata/core/anndata.py", line 1230, in __getitem__
return self._getitem_view(index)
File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/anndata/core/anndata.py", line 1234, in _getitem_view
return AnnData(self, oidx=oidx, vidx=vidx, asview=True)
File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/anndata/core/anndata.py", line 561, in __init__
self._init_as_view(X, oidx, vidx)
File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/anndata/core/anndata.py", line 633, in _init_as_view
self._raw = adata_ref.raw[oidx]
File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/anndata/core/anndata.py", line 335, in __getitem__
oidx, vidx = self._normalize_indices(index)
File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/anndata/core/anndata.py", line 363, in _normalize_indices
obs = _normalize_index(obs, self._adata.obs_names)
File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/anndata/core/anndata.py", line 160, in _normalize_index
positions = positions[index]
File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/pandas/core/series.py", line 911, in __getitem__
return self._get_with(key)
File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/pandas/core/series.py", line 946, in _get_with
return self._get_values(key)
File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/pandas/core/series.py", line 980, in _get_values
return self._values[indexer]
IndexError: boolean index did not match indexed array along dimension 0; dimension is 466 but corresponding boolean dimension is 311
The problem does not appear with the pbmc3k data. It does appear with any other expression matrix, as long as it is in text format.
I noted that the adata object when reading from a text file does not have a real .var content, the .var is a dataframe with just an index. But I have no idea if this is related to the problem.
One more thing: the exception does not happen if I comment out the line:
adata.raw = adata
Hmm...I must admit I don't understand why a "view" exists. Views are often tricky to get right, especially in a complex datastructure like anndata. They also slow down processing, especially if users may not be aware that the object they have is a view of something else. I don't see a good use case for views in my pipeline at least. Is there a way to switch off all views in anndata and just return a copy when slicing?
I wonder if it works if you use adata.raw = adata.copy()
instead. Maybe the issue is a View in adata.raw
?
I've just spent a while trying to replicate, before realizing I've seen this issue before over on AnnData (https://github.com/theislab/anndata/issues/182). I've got some good and bad news about this. It's fixed on master, but that fix is slated to be release in v0.7
, which has intentionally breaking changes.
I find views very useful when dealing with large datasets interactively. They're also important for file backed data, since copies are extremely expensive in that case.
Unlike numpy, AnnData objects should always return a view when subset. If you'd like to get copies, you could add a .copy()
to the end of your subsetting statement.
Hi Malte and Isaac, many thanks for this! Ah, yes that other issue was opened after I opened this one. I did search for the error message before I opened the ticket, but I didn't search again while the ticket was open.
The easiest workaround for me is simply to not use .raw anymore, for a pipeline, it's not really needed anyways.
Yes, I can see why it's important for file backed data, I just cannot see a use case for file backed mode either. Any useful operations on file backed data will be too slow anyways for practical use, and anyone can get a high-RAM machine these days on Amazon for a few hours, so I've always wondered file backed mode exists. (sidenote: File backed data is again a feature that sounds rather complicated to implement. As a user I love libraries that are small, stable and don't change a lot, especially for very foundational things like anndata. I guess it's a matter of development philosophy here). Also, yes, it's because I don't use scanpy interactively that I don't see the use case for views.
anyhow, thanks again, also for all your work on Scanpy!
On Wed, Jul 31, 2019 at 6:27 AM Isaac Virshup notifications@github.com wrote:
I've just spent a while trying to replicate, before realizing I've seen this issue before over on AnnData (theislab/anndata#182 https://github.com/theislab/anndata/issues/182). I've got some good and bad news about this. It's fixed on master, but that fix is slated to be release in v0.7, which has intentionally breaking changes.
I find views very useful when dealing with large datasets interactively. They're also important for file backed data, since copies are extremely expensive in that case.
Unlike numpy, AnnData objects should always return a view when subset. If you'd like to get copies, you could add a .copy() to the end of your subsetting statement.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/theislab/scanpy/issues/728?email_source=notifications&email_token=AACL4TOSRH3R4VHIARSVCILQCEIBZA5CNFSM4H54LI62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3GA6LY#issuecomment-516689711, or mute the thread https://github.com/notifications/unsubscribe-auth/AACL4TIAGHQRLMYYAPGI4JTQCEIBZANCNFSM4H54LI6Q .
suddenly I have this problem, maybe related to an anndata upgrade. pip says all requirements are satisfied:
scanpy==1.4.3 anndata==0.6.22rc1 umap==0.3.9 numpy==1.16.4 scipy==1.2.1 pandas==0.24.2 scikit-learn==0.21.2 statsmodels==0.10.0 python-igraph==0.7.1 louvain==0.6.1
I don't understand why anndata thinks that
IndexError: Key "n_counts" is not valid observation/variable name/index.
even though it's clearly in adata.obs... any suggestions what to do? add print statements to the various functions?