scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.81k stars 584 forks source link

n_counts not found? #728

Open maximilianh opened 5 years ago

maximilianh commented 5 years ago

suddenly I have this problem, maybe related to an anndata upgrade. pip says all requirements are satisfied:

scanpy==1.4.3 anndata==0.6.22rc1 umap==0.3.9 numpy==1.16.4 scipy==1.2.1 pandas==0.24.2 scikit-learn==0.21.2 statsmodels==0.10.0 python-igraph==0.7.1 louvain==0.6.1


adata                                                                                                          
AnnData object with n_obs × n_vars = 466 × 28685 
    obs: 'GEO_Sample_age', 'age', 'age_unit', 'biosample_source_life_stage', 'biosample_source_gender', 'sample_category', 'biosample_cell_type', 'n_genes', 'n_counts', 'percent_mito'
    var: 'n_cells'

fig1 = sc.pl.scatter(adata, x='n_counts', y='n_genes', save="_gene_count")

~/miniconda3/envs/py3/lib/python3.6/site-packages/anndata/core/anndata.py in _get_obs_array(self, k, use_raw, layer)
   1527         obs.keys and then var.index."""
   1528         if use_raw:
-> 1529             return self.raw.obs_vector(k)
   1530         else:
   1531             return self.obs_vector(k=k, layer=layer)

~/miniconda3/envs/py3/lib/python3.6/site-packages/anndata/core/anndata.py in obs_vector(self, k)
    408         as `.obs_names`.
    409         """
--> 410         a = self[:, k].X
    411         if issparse(a):
    412             a = a.toarray()

~/miniconda3/envs/py3/lib/python3.6/site-packages/anndata/core/anndata.py in __getitem__(self, index)
    331 
    332     def __getitem__(self, index):
--> 333         oidx, vidx = self._normalize_indices(index)
    334         if self._adata is not None or not self._adata.isbacked: X = self._X[oidx, vidx]
    335         else: X = self._adata.file['raw.X'][oidx, vidx]

~/miniconda3/envs/py3/lib/python3.6/site-packages/anndata/core/anndata.py in _normalize_indices(self, packed_index)
    360         obs, var = unpack_index(packed_index)
    361         obs = _normalize_index(obs, self._adata.obs_names)
--> 362         var = _normalize_index(var, self.var_names)
    363         return obs, var
    364 

~/miniconda3/envs/py3/lib/python3.6/site-packages/anndata/core/anndata.py in _normalize_index(index, names)
    153         return slice(start, stop, step)
    154     elif isinstance(index, (np.integer, int, str)):
--> 155         return name_idx(index)
    156     elif isinstance(index, (Sequence, np.ndarray, pd.Index)):
    157         # here, we replaced the implementation based on name_idx with this

~/miniconda3/envs/py3/lib/python3.6/site-packages/anndata/core/anndata.py in name_idx(i)
    140                 raise IndexError(
    141                     'Key "{}" is not valid observation/variable name/index.'
--> 142                     .format(i))
    143             i = i_found[0]
    144         return i

I don't understand why anndata thinks that

IndexError: Key "n_counts" is not valid observation/variable name/index.

even though it's clearly in adata.obs... any suggestions what to do? add print statements to the various functions?

maximilianh commented 5 years ago

The weirdest thing is that if I write this adata object to an h5ad file with adata.write("temp.h5ad"), load it from there and run the same command, it works.

I wonder if this indicates some issue with the .obs object or some version issue...

maximilianh commented 5 years ago

Even something simple doesn't work anymore, without going through h5ad:

adata = adata[adata.obs['n_genes'] < up_thrsh_genes, :]
Traceback (most recent call last):
  File "/cluster/home/max/projects/czi/cellBrowser/src/cbScanpy", line 11, in <module>
    cellbrowser.cbScanpyCli()
  File "/cluster/home/max/projects/czi/cellBrowser/src/cbPyLib/cellbrowser/cellbrowser.py", line 4655, in cbScanpyCli
    adata, params = cbScanpy(matrixFname, metaFname, inCluster, confFname, figDir, logFname)
  File "/cluster/home/max/projects/czi/cellBrowser/src/cbPyLib/cellbrowser/cellbrowser.py", line 4353, in cbScanpy
    adata = adata[adata.obs['n_genes'] < up_thrsh_genes, :]
  File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/anndata/core/anndata.py", line 1224, in __getitem__
    return self._getitem_view(index)
  File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/anndata/core/anndata.py", line 1228, in _getitem_view
    return AnnData(self, oidx=oidx, vidx=vidx, asview=True)
  File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/anndata/core/anndata.py", line 557, in __init__
    self._init_as_view(X, oidx, vidx)
  File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/anndata/core/anndata.py", line 629, in _init_as_view
    self._raw = adata_ref.raw[oidx]
  File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/anndata/core/anndata.py", line 333, in __getitem__
    oidx, vidx = self._normalize_indices(index)
  File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/anndata/core/anndata.py", line 361, in _normalize_indices
    obs = _normalize_index(obs, self._adata.obs_names)
  File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/anndata/core/anndata.py", line 160, in _normalize_index
    positions = positions[index]
  File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/pandas/core/series.py", line 911, in __getitem__
    return self._get_with(key)
  File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/pandas/core/series.py", line 946, in _get_with
    return self._get_values(key)
  File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/pandas/core/series.py", line 980, in _get_values
    return self._values[indexer]
IndexError: boolean index did not match indexed array along dimension 0; dimension is 466 but corresponding boolean dimension is 290
maximilianh commented 5 years ago

I wonder if this has to do with the view discussed in #699. The weird thing is these are very basic operations and I imagine this has come up before for someone else...

Anyhow, I'm closing this, #699 gave me the idea that this is just a very recent problem, it works fine with scanpy 1.4.1, I guess this is already on your radar via #699

ivirshup commented 5 years ago

This is separate from that, what's happening is that _get_obs_array had a change a behaviour during a bug fix.

What we should do is

  1. Allow use_raw to be passed while referring to a column of obs in the deprecated method
  2. Finish removing all usage of the deprecated method from scanpy
maximilianh commented 5 years ago

even with scanpy 1.4.1 my very simple (copied from the tutorial) script doesn't work. I'm getting the well-known "TypeError: Categorical is not ordered for operation max you can use .as_ordered() to change the Categorical to an ordered one". So I downgraded anndata, which lead to another new error. I guess I'd also have to downgrade pandas now. This makes me wonder if there is some testing with a standard pipeline done before a release.

LuckyMD commented 5 years ago

The max categorical error was one that I thought was addressed by anndata 0.6.18. I assume this is still on 0.6.22rc1? There was previously a switch from defaulting to ordered categoricals to unordered instead.

There are quite a few unit tests... but clearly not perfect coverage. Others will be able to say more about the coverage than me.

ivirshup commented 5 years ago

The original bug you hit was with the sc.pl.scatter which has few tests.

I'd recommend trying out the master branches of AnnData and scanpy until new releases can be made in cases like these.

LuckyMD commented 4 years ago

I just got the same error with a similar situation.

I get umap coordinates from a collaborator, which I store in adata.obs. Before the last update this worked: sc.pl.scatter(adata, x='UMAP1', y='UMAP2', color='cell_type_class') Now, this produces a IndexError: Key "UMAP1" is not valid observation/variable name/index. error.

Now I need to run this for the same plot: sc.pl.scatter(adata, x='UMAP1', y='UMAP2', color='cell_type_class', use_raw=False)

These covariates are all in adata.obs.keys(). It seems that use_raw is taking precendence over x and y being from adata.obs.

LuckyMD commented 4 years ago

Oh, I also get DeprecationWarning and FutureWarning about layer='X' being removed in future and obs_vector being used, while I assume these are just used in sc.pl.scatter in the background. I guess this is in the process of being fixed though.

ivirshup commented 4 years ago

This should be fixed in the v1.4.4. Could you try that out and see if this is fixed?

maximilianh commented 4 years ago

Hi Isaac, I've updated to v1.4.4 but I'm still getting this problem. I've finally produced a minimal test case:

import scanpy as sc
sc.logging.print_versions()
#adata = sc.datasets.pbmc3k()
adata = sc.read("orig/transpose_rsem_cell_by_gene.tsv.gz")
print(adata)
adata = adata.T
print(adata)
adata.raw = adata
print(adata)
sc.pp.filter_cells(adata, min_genes=200)
print(adata)
adata = adata[adata.obs['n_genes'] < 5000, :]
print(adata)
adata = adata[adata.obs['n_genes'] > 100, :]
print(adata)

output is:


scanpy==1.4.4.post1 anndata==0.6.22.post1 umap==0.3.9 numpy==1.16.4 scipy==1.3.0 pandas==0.24.2 scikit-learn==0.21.2 statsmodels==0.10.0 python-igraph==0.7.1 
Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
AnnData object with n_obs × n_vars = 60498 × 466 
AnnData object with n_obs × n_vars = 466 × 60498 
AnnData object with n_obs × n_vars = 466 × 60498 
AnnData object with n_obs × n_vars = 466 × 60498 
    obs: 'n_genes'
View of AnnData object with n_obs × n_vars = 311 × 60498 
    obs: 'n_genes'
Traceback (most recent call last):
  File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/pandas/core/series.py", line 977, in _get_values
    return self._constructor(self._data.get_slice(indexer),
  File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1510, in get_slice
    return self.__class__(self._block._slice(slobj),
  File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/pandas/core/internals/blocks.py", line 268, in _slice
    return self.values[slicer]
IndexError: boolean index did not match indexed array along dimension 0; dimension is 466 but corresponding boolean dimension is 311

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test.py", line 14, in <module>
    adata = adata[adata.obs['n_genes'] > 100, :]
  File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/anndata/core/anndata.py", line 1230, in __getitem__
    return self._getitem_view(index)
  File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/anndata/core/anndata.py", line 1234, in _getitem_view
    return AnnData(self, oidx=oidx, vidx=vidx, asview=True)
  File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/anndata/core/anndata.py", line 561, in __init__
    self._init_as_view(X, oidx, vidx)
  File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/anndata/core/anndata.py", line 633, in _init_as_view
    self._raw = adata_ref.raw[oidx]
  File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/anndata/core/anndata.py", line 335, in __getitem__
    oidx, vidx = self._normalize_indices(index)
  File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/anndata/core/anndata.py", line 363, in _normalize_indices
    obs = _normalize_index(obs, self._adata.obs_names)
  File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/anndata/core/anndata.py", line 160, in _normalize_index
    positions = positions[index]
  File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/pandas/core/series.py", line 911, in __getitem__
    return self._get_with(key)
  File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/pandas/core/series.py", line 946, in _get_with
    return self._get_values(key)
  File "/cluster/home/max/miniconda3/envs/py3/lib/python3.6/site-packages/pandas/core/series.py", line 980, in _get_values
    return self._values[indexer]
IndexError: boolean index did not match indexed array along dimension 0; dimension is 466 but corresponding boolean dimension is 311

The problem does not appear with the pbmc3k data. It does appear with any other expression matrix, as long as it is in text format.

I noted that the adata object when reading from a text file does not have a real .var content, the .var is a dataframe with just an index. But I have no idea if this is related to the problem.

maximilianh commented 4 years ago

One more thing: the exception does not happen if I comment out the line:

adata.raw = adata
maximilianh commented 4 years ago

Hmm...I must admit I don't understand why a "view" exists. Views are often tricky to get right, especially in a complex datastructure like anndata. They also slow down processing, especially if users may not be aware that the object they have is a view of something else. I don't see a good use case for views in my pipeline at least. Is there a way to switch off all views in anndata and just return a copy when slicing?

LuckyMD commented 4 years ago

I wonder if it works if you use adata.raw = adata.copy() instead. Maybe the issue is a View in adata.raw?

ivirshup commented 4 years ago

I've just spent a while trying to replicate, before realizing I've seen this issue before over on AnnData (https://github.com/theislab/anndata/issues/182). I've got some good and bad news about this. It's fixed on master, but that fix is slated to be release in v0.7, which has intentionally breaking changes.

I find views very useful when dealing with large datasets interactively. They're also important for file backed data, since copies are extremely expensive in that case.

Unlike numpy, AnnData objects should always return a view when subset. If you'd like to get copies, you could add a .copy() to the end of your subsetting statement.

maximilianh commented 4 years ago

Hi Malte and Isaac, many thanks for this! Ah, yes that other issue was opened after I opened this one. I did search for the error message before I opened the ticket, but I didn't search again while the ticket was open.

The easiest workaround for me is simply to not use .raw anymore, for a pipeline, it's not really needed anyways.

Yes, I can see why it's important for file backed data, I just cannot see a use case for file backed mode either. Any useful operations on file backed data will be too slow anyways for practical use, and anyone can get a high-RAM machine these days on Amazon for a few hours, so I've always wondered file backed mode exists. (sidenote: File backed data is again a feature that sounds rather complicated to implement. As a user I love libraries that are small, stable and don't change a lot, especially for very foundational things like anndata. I guess it's a matter of development philosophy here). Also, yes, it's because I don't use scanpy interactively that I don't see the use case for views.

anyhow, thanks again, also for all your work on Scanpy!

On Wed, Jul 31, 2019 at 6:27 AM Isaac Virshup notifications@github.com wrote:

I've just spent a while trying to replicate, before realizing I've seen this issue before over on AnnData (theislab/anndata#182 https://github.com/theislab/anndata/issues/182). I've got some good and bad news about this. It's fixed on master, but that fix is slated to be release in v0.7, which has intentionally breaking changes.

I find views very useful when dealing with large datasets interactively. They're also important for file backed data, since copies are extremely expensive in that case.

Unlike numpy, AnnData objects should always return a view when subset. If you'd like to get copies, you could add a .copy() to the end of your subsetting statement.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/theislab/scanpy/issues/728?email_source=notifications&email_token=AACL4TOSRH3R4VHIARSVCILQCEIBZA5CNFSM4H54LI62YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3GA6LY#issuecomment-516689711, or mute the thread https://github.com/notifications/unsubscribe-auth/AACL4TIAGHQRLMYYAPGI4JTQCEIBZANCNFSM4H54LI6Q .